1.1 Subset the data into the specific dataset allocated
1.3 Data cleaning
# Installing 'ggplot2' - a data visualisation package.
#install.packages("ggplot2")
# Loading the 'ggplot2' library.
library(ggplot2)
Want to understand how all the pieces fit together? Read R for Data Science: https://r4ds.had.co.nz/
# Installing 'dplyr'.
#install.packages("dplyr")
# Loading the 'dplyr' library.
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
# Installing 'VIM'.
#install.packages("VIM")
# Loading the 'VIM' library.
library(VIM)
Loading required package: colorspace
Loading required package: grid
Registered S3 method overwritten by 'data.table':
method from
print.data.table
VIM is ready to use.
Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
Attaching package: ‘VIM’
The following object is masked from ‘package:datasets’:
sleep
# Installing 'validate'.
#install.packages("validate")
# Loading the 'validate' library.
library(validate)
Attaching package: ‘validate’
The following object is masked from ‘package:dplyr’:
expr
The following object is masked from ‘package:ggplot2’:
expr
# Installing 'tree'.
#install.packages("tree")
# Loading the 'tree' library.
library(tree)
# Installing 'mgcv'.
#install.packages("mgcv")
# Loading the 'mgcv' library.
library(mgcv)
Loading required package: nlme
Attaching package: ‘nlme’
The following object is masked from ‘package:dplyr’:
collapse
This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
# Installing 'tidyverse' - a core package which contains a set of functions designed to enable dataframe manipulation.
#install.packages("tidyverse")
# Loading the 'tidyverse' library.
#library(tidyverse)
When given a dataset as an rda object, it can be reloaded using the ‘load’ function. This can either be done by loading the dataset from the GitHub repository or by saving the dataset and loading it from the local drive. The R code used for both methods is available below – with the code used for the second method being commented out.
The guidance given will be used to select two teams to create a subset of the data (which in this instance are the Boston Red Sox and Texas Rangers).
# Loading from GitHub repository
load(url("https://raw.githubusercontent.com/mjshepperd/CS5702-Data/master/CS5801_data.rda"))
# Loading the data file 'CS5801_data.rda' to be used in the coursework.
#load("CS5801_data.rda")
# Adding the dataset into the variable 'baseballData'.
baseballData <- CS5801.data
# Last two digits of student id: '00' (Student ID: 1736500).
# Following the guidance, add 1 to one of the digits since both are the same number.
# New set of digits to be used to determine the teams: '01'.
# Two teams corresponding to the two digits: BOS [Boston Red Sox] & TEX [Texas Rangers] ('0': BOS & '1': TEX).
# These teams are used to create a subset of data that is going to be organised, cleaned, analysed, and modelled in later sections of the coursework.
mySubset <- subset(baseballData, teamID.x=="BOS" | teamID.x=="TEX")
Data quality is a measure of the qualitative or quantitative aspects of a dataset(s) based on factors such as accuracy, completeness, consistency, traceability, reliability, and timeliness (Rouse, 2019). A few common data quality issues that can be looked at include: mandatory fields (making sure that these are not empty), data type (ensuring that data in each attribute is in the data type it needs to be in), range (ensuring that data in a field is within a range that is deemed to be considered ‘normal’). When given a dataset, the first step usually is to assess the quality of the data and clean the data if required (which almost certainly needs to be carried out). It should be noted that assess the quality of the data and cleaning the data is imported as it will lead to better analysis and decision making. Further, ensuing that the data is accurate is sometimes a legal obligation, adding to the importance of this stage.
A comprehensive data quality checking plan includes:
In the case that any issues are found at either stage, they will be dealt with appropriately.
As part of visualising and understanding the dataset as a whole, a number of different functions can be used. These include:
Using all of these functions could also help confirm that the dataset has been loaded correctly.
# Visualising the dataset.
View(mySubset)
# Visualising the structure of the dataset.
str(mySubset)
'data.frame': 84 obs. of 15 variables:
$ playerID : chr "andruel01" "barnema01" "bassan01" "beltrad01" ...
$ teamID.x : Factor w/ 149 levels "ALT","ANA","ARI",..: 131 16 131 131 16 16 16 16 16 16 ...
$ G : int 160 32 33 143 145 156 74 45 18 80 ...
$ R : int 69 0 0 83 92 84 43 0 0 35 ...
$ H : int 154 0 0 163 174 196 55 0 0 69 ...
$ AB : int 596 0 0 567 597 613 221 0 6 273 ...
$ RBI : int 62 0 0 83 77 81 43 0 0 29 ...
$ weight : int 200 210 200 220 180 210 200 190 190 195 ...
$ height : int 72 76 74 71 69 73 70 72 75 69 ...
$ salary : num 15000000 508500 725000 16000000 514500 ...
$ birthDate : Date, format: "1988-08-26" "1990-06-17" "1987-11-01" "1979-04-07" ...
$ career.length: num 5.739 0.312 3.554 16.523 0.509 ...
$ bats : Factor w/ 3 levels "B","L","R": 3 3 3 3 3 3 2 2 2 3 ...
$ age : num 26.3 24.5 27.2 35.7 22.2 ...
$ hit.ind : num 1 0 0 1 1 1 1 0 0 1 ...
# Visualising the dimensions of the dataset.
dim(mySubset)
[1] 84 15
# Visualising the summary of the dataset.
summary(mySubset)
playerID teamID.x G R H AB RBI weight height
Length:84 BOS :42 Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. :-21.00 Min. :170.0 Min. :68.00
Class :character TEX :42 1st Qu.: 21.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:198.8 1st Qu.:71.00
Mode :character ALT : 0 Median : 33.00 Median : 1.50 Median : 2.00 Median : 5.5 Median : 0.00 Median :210.0 Median :73.00
ANA : 0 Mean : 52.21 Mean :17.01 Mean : 34.69 Mean :132.7 Mean : 16.56 Mean :215.6 Mean :73.11
ARI : 0 3rd Qu.: 75.00 3rd Qu.:26.50 3rd Qu.: 54.25 3rd Qu.:207.8 3rd Qu.: 25.50 3rd Qu.:225.0 3rd Qu.:75.00
ATL : 0 Max. :160.00 Max. :94.00 Max. :196.00 Max. :613.0 Max. :108.00 Max. :585.0 Max. :79.00
(Other): 0
salary birthDate career.length bats age hit.ind
Min. : 515 Min. :1975-04-03 Min. : 0.2902 B: 6 Min. :20.91 Min. :0.0000
1st Qu.: 514338 1st Qu.:1983-07-25 1st Qu.: 1.7447 L:25 1st Qu.:26.02 1st Qu.:0.0000
Median : 1250000 Median :1986-03-02 Median : 4.4873 R:53 Median :28.83 Median :1.0000
Mean : 5034773 Mean :1986-02-02 Mean : 4.9172 Mean :29.14 Mean :0.5476
3rd Qu.: 7800000 3rd Qu.:1988-12-24 3rd Qu.: 7.4394 3rd Qu.:31.65 3rd Qu.:1.0000
Max. :24000000 Max. :2002-06-17 Max. :17.3306 Max. :39.75 Max. :1.0000
# Visualising the attribute/column names of the dataset.
names(mySubset)
[1] "playerID" "teamID.x" "G" "R" "H" "AB" "RBI" "weight" "height"
[10] "salary" "birthDate" "career.length" "bats" "age" "hit.ind"
# Visualising the first six rows of the dataset.
head(mySubset)
# Visualising the last six rows of the dataset.
tail(mySubset)
Upon using all of these functions and visualising the dataset, a much better understanding of the type of data available was formed. Using the ‘summary’ function, it was found that some of the attribute do not have a data type that best correspond to its purpose (it was decided that this was going to be looked at an individual attribute level and dealt with accordingly). Using the ‘name’, ‘head’, and ‘tail’ functions, showed that some of the names of attributes were not clear – especially if unfamiliar with baseball terms. While a description of all of the terms are available in the appendix, having more meaningful column titles that follow a coherent pattern would be beneficial.
Data Quality Issue: column titles not meaningful
The next step of data quality checking would be to identify any missing data in the dataset – for which the ‘is.na’ function could be used:
# Checking for missingness in data - as a whole.
any(is.na(mySubset))
[1] FALSE
For this dataset, missing values were not found. However if they were, each column could have been looked at separately (using the function below) to understand which values were missing.
# Checking for missingness of data in individual attributes/columns.
#is.na(mySubset)
It is important to check for missingness of data in the form “” or “ ” – which can be done using the following function.
# Checking for missingness in the form “”.
is.element("",unlist(mySubset))
[1] FALSE
# Checking for missingness in the form “ ”.
is.element(" ",unlist(mySubset))
[1] FALSE
Seeing as there is no missing data in this dataset, the next step of analysis could be carried out, which would be to visualise and investigate each attribute in the dataset can be looked at individually.
This is a unique code given to each player in the team and therefore, should not contain any duplicates. However, it should be noted that one player id can appear across multiple teams in the overall dataset since it has been stated that one player can play for more than one team in a given season.
The ‘summary’ and ‘table’ functions can be used to visualise the playerID column of the dataset:
# Visualising summary of the playerID column.
summary(mySubset$playerID)
Length Class Mode
84 character character
# Looking at variables in the playerID column.
table(mySubset$playerID)
andruel01 barnema01 bassan01 beltrad01 bettsmo01 bogaexa01 bradlja02 breslcr01 brownmj01 buchhcl01 castiru01 cecchga01 chiriro01 choosh01 claudal01
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
cogaera01 cookry01 corpoca01 craigal01 deazaal01 detwiro01 diekmja01 dysonsa01 edwarjo02 felizne01 fieldpr01 freemsa01 fujikky01 gallayo01 greeneb02
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
grokkel01 hamelco01 hamiljo03 hanigry01 hembrhe01 hollade01 holtbr01 jimenlu02 kellyjo05 kleinph01 layneto01 leonsa01 lewisco01 machije01 martile01
1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
martini01 masteju01 mendero01 mileywa01 morelmi01 mosslaf01 mujiced01 murphpr04 napolmi01 navada01 odorro01 ogandal01 ortizda01 pedrodu01 perezjj03
1 1 2 1 1 1 1 1 2 1 1 1 1 2 1
perezma02 porceri01 rabbibb01 ramirha01 rodriwa01 rosalad01 rossro01 ruary01 sandopa01 schepta01 smolija01 stubbdr01 tazawju01 tollesh01 ueharko01
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
varvaan01 venabwi01 victosh01 wilsobo02 wrighst01
1 1 1 1 1
Next, the ‘duplicated’ function can be used to identify if there were any duplicate player IDs:
# Identifying duplicate variables in the Player ID column.
duplicated(mySubset$playerID)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[26] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[51] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[76] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Using the output of these functions, it was found that four different team players (with Player IDs: ‘lewisco01’, ‘mendero01’, ‘napolmi01’, and ‘pedrodu01’) were duplicated. Upon further investigation, it was found that players with player ids ‘mendero01’ and ‘napolmi01’ were members of both teams (BOS and TEX) which means that these two duplicate rows are not a data quality issue. However, the players with player ids ‘lewisco01’ and ‘pedrodu01’, were duplicated (with data of lewisco01 being duplicated in rows 481 & 959 and data of pedrodu01 being duplicated in rows 632 & 919), therefore, this was seen as an issue.
Data Quality Issue: rows with duplicate player ids
In terms of the datatype, considering the fact that the playerID contains a set of characters (letters and numbers) that are used to uniquely identify each player, the data type of ‘chr’ seems to be a perfect fit.
This is a unique code given to each team in the season. The full dataset contains information of all the different teams, however, the subset used in this coursework will only contain records of teams BOS and TEX.
The ‘summary’ functions can be used to visualise the teamID.x column of the dataset (the ‘table’ function that has been commented out can also be used, however, it would produce the same results as the ‘summary’ function):
# Visualising summary of the teamID.x column.
summary(mySubset$teamID.x)
BOS TEX ALT ANA ARI ATL BAL BFN BFP BL1 BL2 BL3 BL4 BLA BLF BLN BLU BR1 BR2
42 42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BR3 BR4 BRF BRO BRP BS1 BS2 BSN BSP BSU BUF CAL CH1 CH2 CHA CHF CHN CHP CHU
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CIN CL1 CL2 CL3 CL4 CL5 CL6 CLE CLP CN1 CN2 CN3 CNU COL DET DTN ELI FLO FW1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HAR HOU HR1 IN1 IN2 IN3 IND KC1 KC2 KCA KCF KCN KCU KEO LAA LAN LS1 LS2 LS3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
MIA MID MIL MIN ML1 ML2 ML3 ML4 MLA MLU MON NEW NH1 NY1 NY2 NY3 NY4 NYA NYN
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NYP OAK PH1 PH2 (Other)
0 0 0 0 0
# Looking at variables in the teamID.x column.
#table(mySubset$teamID.x)
Upon looking at the summary of the Team IDs, it can be confirmed that only data from teams BOS and TEX are available in this subset. This was the expected result from this column of data, therefore, it can be stated that there does not seem to be an immediate problem with this attribute of the data.
The data type ‘fctr’ used for this column makes sense since it contains categorical data that is unordered.
This is a count of the number of games the respective player had played in.
# Visualising summary of the G column.
summary(mySubset$G)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 21.00 33.00 52.21 75.00 160.00
# Visualising the set of values in the R column.
list(mySubset$G)
[[1]]
[1] 160 32 33 143 145 156 74 45 18 80 2 78 149 18 5 33 36 60 17 26 31 11 18 158 54 2 33 12 50 54 22 10 129 1 25 11 64 41
[39] 33 26 95 24 18 12 3 32 132 11 35 98 29 120 64 146 93 14 28 105 17 55 54 28 126 42 35 27 61 73 43 9 37 33 31 16 93 22
[77] 28 16 23 33 94 84 70 82
Seeing as a baseball season generally consists of 162 games, the fact that the maximum number of games a player played in is 160 means that there is no problem, on the surface level, with this data.
In terms of the data type, ‘int’ is the best type to represent this data as it is numerical (interval). In terms of the data type of ‘Game’, ‘int’ makes the most sense as this is a count and therefore will only contain numbers that are whole numbers.
This is a count of all the runs of a respective player.
# Visualising summary of the R column.
summary(mySubset$R)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 1.50 17.01 26.50 94.00
# Visualising the set of values in the R column.
list(mySubset$R)
[[1]]
[1] 69 0 0 83 92 84 43 0 0 35 0 33 94 0 0 10 6 23 0 0 0 0 0 78 0 0 0 0 22 28 0 0 56 0 0 0 0 8 0 0 26 2 0 0 0 0 51 0 9 37 6
[52] 54 0 73 46 0 0 59 1 14 0 10 43 0 12 6 0 0 0 0 6 10 5 0 46 0 9 24 6 0 36 22 30 22
The minimum and maximum values of Runs are both within the expected range and there was no problem identified upon visualising the set of values in this attribute. Therefore, it is reasonable to assume that these values are accurate and no further cleaning needs to be done.
In terms of the data type of ‘Runs’, ‘int’ makes the most sense as this is a count and therefore will only contain numbers that are whole numbers.
This is a count of all the hits (times the base was reached because of a batted, fair ball - without an error by the defence).
# Visualising summary of the H column.
summary(mySubset$H)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 2.00 34.69 54.25 196.00
# Visualising the set of values in the H column.
list(mySubset$H)
[[1]]
[1] 154 0 0 163 174 196 55 0 0 69 0 54 153 0 0 19 12 47 0 0 0 0 0 187 0 0 2 1 43 43 0 0 127 0 1 0 0 21
[39] 0 0 63 0 2 0 0 0 131 0 23 68 10 111 0 144 111 0 0 100 1 26 0 16 115 0 8 2 0 0 0 0 12 23 17 0 111 0
[77] 4 56 11 0 73 61 42 52
The minimum and maximum values of Hits are both within the expected range and there was no problem identified upon visualising the set of values in this attribute. Therefore, it is reasonable to assume that these values are accurate and no further cleaning needs to be done.
In terms of the data type of ‘Hits’, ‘int’ makes the most sense as this is a count and therefore will only contain numbers that are whole numbers.
This is the count of the number of times a batter reaches base via a fielder’s choice, hit or an error (not including catcher’s interference) or when a batter is put out on a non-sacrifice.
# Visualising summary of the AB column.
summary(mySubset$AB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 5.5 132.7 207.8 613.0
# Visualising the set of values in the AB column.
list(mySubset$AB)
[[1]]
[1] 596 0 0 567 597 613 221 0 6 273 4 233 555 0 0 107 79 161 0 0 0 0 0 613 0 0 4 3 170 174 0 0 454 1 5 0 0 114
[39] 2 0 288 4 3 0 0 2 471 0 78 329 66 426 0 528 381 2 2 401 2 114 0 83 470 0 60 21 0 0 0 0 66 94 77 1 381 0
[77] 97 213 31 2 337 206 184 173
The minimum and maximum values of At Bat are both within the expected range and there was no problem identified upon visualising the set of values in this attribute. Therefore, it is reasonable to assume that these values are accurate and no further cleaning needs to be done.
In terms of the data type of ‘At Bat’, ‘int’ makes the most sense as this is a count and therefore will only contain numbers that are whole numbers.
This is the total count of the runs batted in.
# Visualising summary of the RBI column.
summary(mySubset$RBI)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-21.00 0.00 0.00 16.56 25.50 108.00
On looking at the output from the summary of RBI, it is clear right away that there is a problem with the data. The RBI is a total count and therefore cannot be a negative number. The summary function shows that the minimum value in this column is -21 which is an error value. Therefore, a list of rows with negative numbers for the RBI value will need to be found.
# Finding the rows with a negative RBI values.
subset(mySubset,RBI<0)
Upon further investigation, it was found that there was another negative value in this column, which means that there are two error values (with player ids: “grokkel01” & “perezjj03”) in total in this column.
Data Quality Issue: negative values ‘runs batted in’ in values
# Visualising the set of values in the RBI column.
list(mySubset$RBI)
[[1]]
[1] 62 0 0 83 77 81 43 0 0 29 0 34 82 0 0 15 3 25 0 0 0 0 0 98 0 0 0 0 25 16 0 0 45 0 1 0 0 3
[39] 0 0 25 0 1 0 0 0 85 0 10 40 7 61 0 108 42 0 0 53 0 7 0 7 47 0 6 0 0 0 0 0 3 4 10 0 42 0
[77] 5 25 -4 0 50 27 -21 29
Other than the two negative values, visualising the data in the data in the Runs Batted In column showed that there was no further problems with its data.
In terms of the data type of ‘Runs Batted In’, ‘int’ makes the most sense as this is a count and therefore will only contain numbers that are whole numbers.
This is a measure of each player’s weight in pounds.
# Visualising summary of the weight column.
summary(mySubset$weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
170.0 198.8 210.0 215.6 225.0 585.0
Looking at the minimum and maximum values of weights, it is clear that while the minimum value of 170 lbs can be thought of as being normal, the maximum value of 585 lbs is well over the maximum weight of a person (NHS, 2020). Though the NHS (in the UK) was used as guidance to understand which extreme values of weight and which ones would be considered ‘Normal’, it can be assumed that these values will apply to these baseball players (in the US) as well. Even though, average weights of people in different countries may differ, it is unlikely that the extreme values will differ significantly. Therefore, the range set out by the NHS can be taken into account when trying to find error values. Using this assumption, it could be thought that the maximum value of 585 lbs is an error value (i.e. an outlier).
Further, the median and the mean are close enough to understand that the rest of the data in this attribute are accurate. Though, a test can be carried out to calculate the mean and the median of this data ignoring the extreme value of 585 lbs.
# Finding the extreme values of weight.
subset(mySubset,weight>374)
Data Quality Issue: weight exceeds 374 lbs - which is deemed ‘too high’
# Calculating the mean weight of the subset without taking into account values outside the range (NHS, 2020).
mean(mySubset$weight[mySubset$weight<374])
[1] 211.1566
# Calculating the median weight of the subset without taking into account values outside the range (NHS, 2020).
median(mySubset$weight[mySubset$weight<374])
[1] 210
After removing the extreme value, the mean and the median seem to be very close together proving that the rest of the data is accurate.
In terms of the data type, given that weight is numerical and continuous, the given data type of ‘int’ might not be the ideal one. Instead, ‘dbl’ would be a more fitting data type for weight.
Data Quality Issue: datatype of weight should be continuous
This is a measure of each player’s height in inches.
# Visualising summary of the height column 58.2677 - 78.7402 (NHS, 2020).
summary(mySubset$height)
Min. 1st Qu. Median Mean 3rd Qu. Max.
68.00 71.00 73.00 73.11 75.00 79.00
Looking at the maximum and minimum values, they seem to be within the range of ‘normal’ values (NHS, 2020). Further, the mean and the median of these values are very close (almost the same) which adds clarity that these values might be accurate.
In terms of the data type, similar to the weight, the height is also numerical and continuous. Therefore, the given data type of ‘int’ might not be the ideal one but rather, ‘dbl’ would be a more fitting data type for the height of players.
Data Quality Issue: datatype of height should be continuous
This is a measure of the salry of each player.
# Visualising summary of the salary column.
summary(mySubset$salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
515 514338 1250000 5034773 7800000 24000000
# Visualising the set of values in the salary column.
list(mySubset$salary)
[[1]]
[1] 1.500000e+07 5.085000e+05 7.250000e+05 1.600000e+07 5.145000e+05 5.430000e+05 5.280000e+05 2.000000e+06 1.200000e+07 1.127100e+07 5.085000e+05
[12] 5.182900e+05 1.400000e+07 5.085000e+05 1.400000e+06 9.750000e+05 5.500000e+06 5.000000e+06 3.450000e+06 5.355000e+05 5.075000e+05 5.085000e+05
[23] 4.125000e+06 2.400000e+07 5.170000e+05 1.100000e+06 1.400000e+07 2.350000e+07 2.270875e+07 3.500000e+06 5.095000e+05 7.400000e+06 5.305000e+05
[34] 5.085000e+05 6.030000e+05 5.095000e+05 5.570000e+05 5.104000e+05 4.000000e+06 5.275000e+05 4.750000e+06 5.150000e+05 9.500000e+06 5.095000e+05
[45] 5.095000e+05 3.666000e+06 2.950000e+06 4.750000e+06 1.600000e+07 1.600000e+07 1.850000e+06 5.138500e+05 1.500000e+06 1.600000e+07 1.250000e+07
[56] 1.000000e+06 1.250000e+07 1.975000e+07 5.070000e+05 9.000000e+05 5.665000e+05 5.085000e+05 1.760000e+07 5.152000e+05 5.085000e+05 5.825000e+06
[67] 2.250000e+06 5.197000e+05 9.000000e+06 5.765000e+05 4.250000e+06 1.300000e+07 7.000000e+05 5.105000e+05 1.250000e+07 3.585000e+05 5.145250e+02
[78] 4.430000e+05 1.250000e+07 4.000000e+06 6.500075e+02 6.000000e+06 5.086000e+05 2.500000e+06
On the whole, there seemed to be nothing wrong with the salary attribute other than the fact that the minimum salary was deemed ‘too small’ (to be an annual salary). Further investigation was necessary, looking at the player with a salaries lower than the usual minimum (ESPN.co.uk, 2020).
# Finding the rows with salaries less than the minimum (ESPN.co.uk, 2020).
subset(mySubset,salary<46000)
Using the minimum salary guidance set out, it was found that two players (with player ids: “greeneb02” & “mosslaf0”) which can be thought of as being incorrect. Therefore, these values will need to be death with.
Data Quality Issue: salary should not be lower than the usual minimum
Other than this, it was also noticed that the format in which salary is stored is not the most readable. However, since it was still understandable and the fact that changing this format might have led to some loss in the data, it was left as it is.
In terms of the data type, ‘double’ is seen as the most suitable datatype for this attribute. Therefore, no further changes need to be made to the datatype of salary.
Each player’s date of birth.
# Visualising summary of the birthdate column.
summary(mySubset$birthDate)
Min. 1st Qu. Median Mean 3rd Qu. Max.
"1975-04-03" "1983-07-25" "1986-03-02" "1986-02-02" "1988-12-24" "2002-06-17"
# Visualising the set of values in the birthdate column.
list(mySubset$birthDate)
[[1]]
[1] "1988-08-26" "1990-06-17" "1987-11-01" "1979-04-07" "1992-10-07" "1992-10-01" "1990-04-19" "1980-08-08" "1984-08-14" "1987-07-09" "1991-04-20"
[12] "1984-06-05" "1982-07-13" "1992-01-31" "1987-06-30" "1984-01-07" "1984-07-18" "1984-04-11" "1986-03-06" "1987-01-21" "1988-05-07" "1988-01-08"
[23] "1988-05-02" "1984-05-09" "1987-06-24" "1980-07-21" "1986-02-27" "1983-12-27" "1981-05-21" "1980-08-16" "1989-01-13" "1986-10-09" "1988-06-11"
[34] "1988-01-18" "1988-06-09" "1989-04-30" "1984-11-02" "1989-03-13" "1979-08-02" "1982-02-01" "1988-03-06" "1990-08-05" "1985-03-22" "1990-07-25"
[45] "1990-07-25" "1986-11-13" "1985-09-06" "1984-05-10" "1981-10-31" "1981-10-31" "1983-02-22" "1994-02-03" "1983-10-05" "1975-11-18" "1983-08-17"
[56] "1991-04-04" "1988-12-27" "1983-12-23" "1979-01-18" "1983-05-20" "1989-06-24" "1990-03-11" "1986-08-11" "1987-01-17" "1989-02-09" "1984-10-04"
[67] "1986-06-06" "1988-01-19" "1975-04-03" "1984-10-31" "1982-10-29" "1980-11-30" "1983-04-08" "1984-08-30" "1983-08-17" "2002-06-17" "1992-11-07"
[78] "1992-10-01" "1984-08-17" "1979-08-02" "1983-09-16" "1981-10-18" "1988-12-23" "1981-04-17"
At a high level, the date of birth attribute does not demonstrate any data quality problems. However, seeing at this was data from the year 2015 and players are unlikely to play professionally until they turn 18 (Mathewson, 2019), all players that have a date of birth after 1997 (which would make them 18 in 2015) can be looked at. As there was one player (with player id: “brownmj01”) that was born after 1997, this was seen as a data quality issue.
Data Quality Issue: players have a date of birth after 1997 – making them underage professional players
In terms of the data type, ‘date’ makes perfect sense for this attribute as it contains the date of birth.
This is a measure of the length of each player’s career (in years).
# Visualising summary of the career.length column.
summary(mySubset$career.length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2902 1.7447 4.4873 4.9172 7.4394 17.3306
# Visualising the set of values in the salary column.
list(mySubset$career.length)
[[1]]
[1] 5.7385352 0.3121150 3.5537303 16.5229295 0.5092402 1.3661875 1.7522245 9.4428474 7.3757700 0.2902122 0.5859001 3.4579055 9.6974675
[14] 0.3860370 3.4524298 5.6563997 4.7337440 7.7508556 7.3182752 2.6310746 2.4914442 0.3805613 5.4127310 9.5523614 2.5845311 1.7522245
[27] 7.5400411 8.6406571 7.7508556 7.3127995 1.3278576 5.6947296 2.3326489 1.7221081 2.5598905 0.4188912 2.3819302 2.6338125 12.7529090
[40] 2.3271732 3.3319644 0.7419576 6.6885695 0.4845996 0.4845996 3.3675565 4.4271047 8.5311431 8.6625599 8.6625599 4.5557837 0.6516085
[53] 4.5475702 17.3305955 8.3613963 2.5133470 5.7303217 9.2813142 9.6098563 6.3956194 2.7323751 0.3422313 6.3819302 2.5681040 0.4873374
[66] 5.3689254 5.4017796 2.5681040 5.7330595 4.2710472 6.3408624 11.7508556 6.6776181 1.6919918 8.3613963 0.3121150 1.5924025 1.3661875
[79] 5.3613963 12.7529090 7.4058864 8.3312799 0.4791239 10.3025325
Upon looking at the length of each player’s career (in years) as a summary (mean, median, minimum, and maximum) and all the values by itself, it is apparent that there is no major problem with them.
In terms of the data type though, ‘dbl’ might not be seen as the best data type to be used for this instance. Since the career length is in years, a better data type to use would be ‘int’ which would still be numerical but measure values at an interval as opposed to continuous. It is possible to change the data type of this attribute to ‘int’, however, this is not done as changing a double to an integer means that some data will be lost.
The hand used by each player to bat: Left [L], Right [R], and Both [B] (i.e. ambidextrous players).
# Visualising summary of the bats column.
summary(mySubset$bats)
B L R
6 25 53
The Bats attribute contains Left [L], Right [R], or Both [B] and there seem to be no data quality issues with it.
The data type of ‘fctr’ used for this column makes sense since it is going onto only contain one of three sets values.
This is a measure of the age of each player.
# Visualising summary of the age column.
summary(mySubset$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.91 26.02 28.83 29.14 31.65 39.75
The age by itself does not seem to have any data quality issues. However, when comparing the date of birth with the age, there are instances where these do not match. Seeing as there is not enough information to identify if the age or the date of birth is the error, it was decided that will be left as it is.
While the datatype of age should have ideally been an integer (similar to career length), changing a double to an integer means that some data would be lost. Therefore, it will be left as it is.
This shows if a player made any hits in the given season (2015). This value would be one [1] if the given player made at least one hit and zero [0] if they made none.
# Visualising the 0s and 1s of the hit.ind column.
table(mySubset$hit.ind)
0 1
38 46
An additional row can be created to compare the data from the Hit Ind column with the data from the Hits column.
# Creating an additional column that compares the Hits [H] with the Hit Ind.
# If the Hits [H] column contains zero [0], the Hit Ind column should also be zero [0].
# However, if the Hits [H] column contains a value greater than or equal to one [1], the Hit Ind column should be one [1].
mySubset$CorrectHitInd <- ifelse((mySubset$H == 0) & (mySubset$hit.ind == 0), 'Correct: 0H',
ifelse((mySubset$H >= 1) & (mySubset$hit.ind == 1), 'Correct 1H', 'Incorrect'))
# Visualising correct values (OF hits & 1 hit) and incorrect values.
table(mySubset$CorrectHitInd)
Correct 1H Correct: 0H
46 38
It is clear that there is no error between the Hits and the Hit Ind columns in this dataset. The results show that all 84 rows contain accurate data. Therefore, the additional row created to check this can now be deleted.
# Deleting the newly created 'correct.hit.ind' column after assessing its accuracy.
mySubset$CorrectHitInd <- NULL
All these steps proved that the data in the Hit Ind column is in fact accurate. However, the data type of ‘dbl’ used for this attribute does not fit the purpose of this. The Hit Ind column while having numerical values ‘0’ and ‘1’ is categorical since it contains a binary response (i.e. ‘0’ if the player did not make any hits and ‘1’ if they made any). Due to this, the best data type for this row might be ‘fctr’ seeing as it is only going to contain categorical data values of ‘0’ or ‘1’.
An alternative method of data quality checking would be to use validation rules. This could have been done if different rules about the different attributes were known. A few attributes will be used to demonstrate an example this method (while it is known that these are not really rules, they are used to demonstrate how the function could work): If it was known that the teams were supposed to be either ‘BOS’ and/or ‘TEX’, that the games count was supposed to be between 0 and 200, that the runs batted in count was supposed to be between 0 and 120, that the weight was supposed to be between 150 and 350, and that the height was supposed to be between 60 and 80, the following validation rules could have been implemented.
# Visualising the data.
head(mySubset)
# Creating a set of validation rules.
validationCheck <- check_that(mySubset,
teamOK = teamID.x == "BOS" | teamID.x == "TEX",
gamesMinOK = G >= 0,
gamesMaxOK = G < 200,
runsBattedInMinOK = RBI >= 0,
runsBattedInMaxOK = RBI < 120,
weightMinOK = weight > 150,
weightMaxOK = weight < 350,
heightMinOK = height > 60,
heightMaxOK = height < 80)
# Without giving each validation rule a name, they will be displayed as V1, V2, V3, and V4 which might make it more difficult to understand the purpose of the rules. Due to this, it is though that giving validation rules names is good practice.
barplot(validationCheck,
main="Validation Rules Check")
It is clear by looking at the results that there are errors with negative values of RBI and extreme values of weight on the upper end (which was found through the previous method of analysis as well). Therefore, it can be said that these validations rules can be used for the cleaning stage as well.
Having identified some of the data quality issues, data cleaning can be carried out on this dataset
Data cleaning, simply put, is the process of detecting and correcting corrupt or inaccurate data in a dataset(s) (Formplus, 2020). Clean data, while often neglected, is extremely important since cleaner data will lead to better decision making. In some cases, ensuring that the data is of a high quality might be a legal obligation further proving the importance of data cleaning. Some of the data quality issues discovered in the previous section could be looked at and cleaned.
When data quality problems are found (error values and missing values), there are a number of ways of dealing with them:
In this report, imputing the data would be used in every possible instance with the row being deleted if this is not possible.
Creating a new subset to make changes to address data quality issues found (variable names, duplicate rows, etc.). The reason a new subset is created with the same set of data is to ensure that the original data is available to always go back to in the case that it is required. [In the case if the coursework, the rda file can always be loaded again to get the original data, however, it might be seen as being beneficial to create a new subset in practice.]
Data Cleaning: add meaningful names to all attributes
# Creating a new subset.
myUpdatedSubset = mySubset %>%
rename(
# Changing column name of attribute 'playerID' in the new subset.
PlayerID = playerID,
# Changing column name of attribute 'teamID.x' in the new subset.
TeamID = teamID.x,
# Changing column name of attribute 'G' in the new subset.
GamesCount = G,
# Changing column name of attribute 'R' in the new subset.
RunsCount = R,
# Changing column name of attribute 'H' in the new subset.
HitsCount = H,
# Changing column name of attribute 'AB' in the new subset.
AtBatCount = AB,
# Changing column name of attribute 'RBI' in the new subset.
RunsBattedInCount = RBI,
# Changing column name of attribute 'weight' in the new subset.
Weight = weight,
# Changing column name of attribute 'height' in the new subset.
Height = height,
# Changing column name of attribute 'salary' in the new subset.
Salary = salary,
# Changing column name of attribute 'birthDate' in the new subset.
Birthdate = birthDate,
# Changing column name of attribute 'career.length' in the new subset.
CareerLength = career.length,
# Changing column name of attribute 'bats' in the new subset.
Bats = bats,
# Changing column name of attribute 'age' in the new subset.
Age = age,
# Changing column name of attribute 'hit.ind' in the new subset.
HitInd = hit.ind
)
# Visualising the newly created subset of data.
View(myUpdatedSubset)
Data Cleaning: remove rows with duplicate player ids – that are in the same team
Deleting rows with duplicate player ids can be carried out using a number of functions. Seeing as some duplicate players are valid (they play for two separate teams) the first method shown could not be used. Instead, a second method of identifying the player ids and deleting them could be used.
# First Method of Deleting Duplicate Values.
# Deleting rows that have duplicate player IDs.
#myUpdatedSubset<-myUpdatedSubset[!duplicated((myUpdatedSubset$PlayerID)),]
# Second Method of Deleting Duplicate Values.
# Identifying rows which contain duplicate player ids.
which(myUpdatedSubset$PlayerID == "lewisco01")
[1] 39 80
which(myUpdatedSubset$PlayerID == "pedrodu01")
[1] 55 75
# Deleting rows which contain duplicate player ids.
myUpdatedSubset<-myUpdatedSubset[-c(80, 75),]
The ‘duplicated’ function can be run to make sure that the deletion of duplicate rows did in fact work.
Data Cleaning: correct negative values of ‘runs batted in’
Given that there are two sets of runs batted in values that are negative, these need to be corrected. One method of doing this could be to replace these values with the mean or median (so as to not change the mean or median values), however, a much more appropriate method of imputation could be used in this instance. The correlation between the runs batted in attribute and various other attributes can be looked at to determine the best way to impute these values.
# Calculating the correlation between RunsBattedInCount & GamesCount.
cor.test(myUpdatedSubset$RunsBattedInCount, myUpdatedSubset$GamesCount)
Pearson's product-moment correlation
data: myUpdatedSubset$RunsBattedInCount and myUpdatedSubset$GamesCount
t = 17.82, df = 80, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8395682 0.9303161
sample estimates:
cor
0.8937424
# Calculating the correlation between RunsBattedInCount & RunsCount.
cor.test(myUpdatedSubset$RunsBattedInCount, myUpdatedSubset$RunsCount)
Pearson's product-moment correlation
data: myUpdatedSubset$RunsBattedInCount and myUpdatedSubset$RunsCount
t = 25.384, df = 80, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9130305 0.9630591
sample estimates:
cor
0.9431645
# Calculating the correlation between RunsBattedInCount & HitsCount.
cor.test(myUpdatedSubset$RunsBattedInCount, myUpdatedSubset$HitsCount)
Pearson's product-moment correlation
data: myUpdatedSubset$RunsBattedInCount and myUpdatedSubset$HitsCount
t = 28.123, df = 80, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9278331 0.9694824
sample estimates:
cor
0.9529643
# Calculating the correlation between RunsBattedInCount & AtBatCount.
cor.test(myUpdatedSubset$RunsBattedInCount, myUpdatedSubset$AtBatCount)
Pearson's product-moment correlation
data: myUpdatedSubset$RunsBattedInCount and myUpdatedSubset$AtBatCount
t = 26.852, df = 80, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9214591 0.9667235
sample estimates:
cor
0.9487509
Since it was found that the ‘hits’ value has the highest correlation with the ‘runs batted in’ value, this will be used for the imputation.
# To use the ‘hotdeck()’ function for imputation, the values that need to be NA so error values could be changed to NA before using the ‘hotdeck()’ function.
# Finding the values of the rows to be imputed.
which(myUpdatedSubset$PlayerID == "perezjj03")
[1] 81
which(myUpdatedSubset$PlayerID == "grokkel01")
[1] 78
# Changing the rows to be imputed to ‘NA’.
myUpdatedSubset[81,7] <- NA
myUpdatedSubset[78,7] <- NA
# Imputing the BMI column using the ‘hotdeck()’ function.
myUpdatedSubset <- hotdeck(myUpdatedSubset, variable = "RunsBattedInCount", ord_var = "HitsCount")
# It should be noted that the ‘hotdeck()’ function created an additional column in the dataframe that is set to FALSE if the column has not been imputed and is set to TRUE if it has. This column of data is useful in some instances but is not for this problem, so it will be deleted.
# Deleting the 'RunsBattedInCount_imp' column created by the 'hotdeck' function.
myUpdatedSubset$RunsBattedInCount_imp <- NULL
The ‘subset’ function can be used to check if the imputation worked correctly.
# Checking if any rows contain negative RunsBattedInCount values.
subset(myUpdatedSubset,RunsBattedInCount < 0)
It is clear that this imputation worked correctly – with the hot deck function changing the ‘RunsBattedInCount’ of both players to seven.
Another method of imputation would be to use regression (i.e. use the games, runs, hits, and at bat counts to build a model), using the model to determine the appropriate values of runs batted in.
In this dataset, there were other attributes that had a high correlation with the runs batted in value – which made it possible to carry out imputation. However, if this was not possible, these rows could have been deleted using the functions below.
# Finding the values of the rows to be deleted.
#which(myUpdatedSubset$PlayerID == "greeneb02")
#which(myUpdatedSubset$PlayerID == "mosslaf01")
# Deleting rows.
#myUpdatedSubset<-myUpdatedSubset[-c(81, 78),]
Data Cleaning: correct weight values that exceeds 374 lbs – which is deemed ‘too high’
In order to correct the weight value that is deemed ‘too high’, the correlation between the weight and other attributes needs to be found.
# Calculating the correlation between Weight and GamesCount.
cor.test(myUpdatedSubset$Weight, myUpdatedSubset$GamesCount)
Pearson's product-moment correlation
data: myUpdatedSubset$Weight and myUpdatedSubset$GamesCount
t = 0.82242, df = 80, p-value = 0.4133
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1279874 0.3025582
sample estimates:
cor
0.09156285
# Calculating the correlation between Weight and RunsCount.
cor.test(myUpdatedSubset$Weight, myUpdatedSubset$RunsCount)
Pearson's product-moment correlation
data: myUpdatedSubset$Weight and myUpdatedSubset$RunsCount
t = 0.4503, df = 80, p-value = 0.6537
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1685648 0.2644037
sample estimates:
cor
0.05028157
# Calculating the correlation between Weight and HitsCount.
cor.test(myUpdatedSubset$Weight, myUpdatedSubset$HitsCount)
Pearson's product-moment correlation
data: myUpdatedSubset$Weight and myUpdatedSubset$HitsCount
t = 0.69799, df = 80, p-value = 0.4872
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1415963 0.2899138
sample estimates:
cor
0.07780143
# Calculating the correlation between Weight and AtBatCount.
cor.test(myUpdatedSubset$Weight, myUpdatedSubset$AtBatCount)
Pearson's product-moment correlation
data: myUpdatedSubset$Weight and myUpdatedSubset$AtBatCount
t = 0.60958, df = 80, p-value = 0.5439
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1512436 0.2808581
sample estimates:
cor
0.06799519
# Calculating the correlation between Weight and RunsBattedInCount.
cor.test(myUpdatedSubset$Weight, myUpdatedSubset$RunsBattedInCount)
Pearson's product-moment correlation
data: myUpdatedSubset$Weight and myUpdatedSubset$RunsBattedInCount
t = 0.99393, df = 80, p-value = 0.3233
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1091790 0.3197875
sample estimates:
cor
0.1104448
# Calculating the correlation between Weight and Height.
cor.test(myUpdatedSubset$Weight, myUpdatedSubset$Height)
Pearson's product-moment correlation
data: myUpdatedSubset$Weight and myUpdatedSubset$Height
t = 1.0886, df = 80, p-value = 0.2796
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.09877587 0.32919696
sample estimates:
cor
0.1208211
# Calculating the correlation between Weight and Salary.
cor.test(myUpdatedSubset$Weight, myUpdatedSubset$Salary)
Pearson's product-moment correlation
data: myUpdatedSubset$Weight and myUpdatedSubset$Salary
t = 0.6597, df = 80, p-value = 0.5113
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1457774 0.2859986
sample estimates:
cor
0.07355668
# Calculating the correlation between Weight and CareerLength.
cor.test(myUpdatedSubset$Weight, myUpdatedSubset$CareerLength)
Pearson's product-moment correlation
data: myUpdatedSubset$Weight and myUpdatedSubset$CareerLength
t = 1.9829, df = 80, p-value = 0.05081
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.0005988785 0.4139987124
sample estimates:
cor
0.2164364
# Calculating the correlation between Weight and Age.
cor.test(myUpdatedSubset$Weight, myUpdatedSubset$Age)
Pearson's product-moment correlation
data: myUpdatedSubset$Weight and myUpdatedSubset$Age
t = 1.3655, df = 80, p-value = 0.1759
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.0683249 0.3562593
sample estimates:
cor
0.1509199
Seeing as the weight attribute does not significantly correlate to any other attribute, the best approach for this would be to delete this row.
# Finding the value of the row to be deleted.
which(myUpdatedSubset$PlayerID == "rabbibb01")
[1] 82
# Deleting the row from the subset.
myUpdatedSubset<-myUpdatedSubset[-c(82),]
# Checking if any rows of Weight contain values greater than 374 lbs.
subset(myUpdatedSubset,Weight > 374)
The final function showed that the dataset now does not contain weight values that are classified as being ‘too high’.
Data Cleaning: correct low salaries
In order to correct the salary values, the correlation between these and other attributes can be found (similar to before).
# Calculating the correlation between Salary & CareerLength.
cor.test(myUpdatedSubset$Salary, myUpdatedSubset$CareerLength)
Pearson's product-moment correlation
data: myUpdatedSubset$Salary and myUpdatedSubset$CareerLength
t = 7.746, df = 79, p-value = 2.719e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5121270 0.7655358
sample estimates:
cor
0.6570081
# Calculating the correlation between Salary & Age.
cor.test(myUpdatedSubset$Salary, myUpdatedSubset$Age)
Pearson's product-moment correlation
data: myUpdatedSubset$Salary and myUpdatedSubset$Age
t = 4.0213, df = 79, p-value = 0.0001315
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2130307 0.5784888
sample estimates:
cor
0.4122063
Seeing as there is no significant correlation between the salary and various other attributes, deleting these rows is seen as an appropriate method of dealing with this data quality issue.
# Finding the values of the rows to be deleted.
which(myUpdatedSubset$PlayerID == "greeneb02")
[1] 76
which(myUpdatedSubset$PlayerID == "mosslaf01")
[1] 79
# Deleting rows from the subset.
myUpdatedSubset<-myUpdatedSubset[-c(76, 79),]
Data Cleaning: correct players have a date of birth after 1997
As discussed previous as well, since it is not known if the ‘date of birth’ or the ‘age’ is the error, the best solution for this data quality issue would be to delete this row (with a date of birth after 1997).
# Finding the value of the row to be deleted.
which(myUpdatedSubset$PlayerID == "brownmj01")
[1] 75
# Deleting the row from the subset.
myUpdatedSubset<-myUpdatedSubset[-c(75),]
Data Cleaning: change datatypes of attributes
As previously identified, a few attributes do not have data types that are ideal. These include: Weight, Height, CareerLength, Age, and HitInd. Out of these, only the data type of Weight, Height, and HitInd can be changed since changing the data type of CareerLength or Age would mean that some of the information would be lost. The head function can then be used to confirm that these changes have indeed been made.
# Visualising the new subset - along with new data types.
head(myUpdatedSubset)
The data types of the three attributes (Weight, Height, and HitInd) can be changed to best reflect the type of data it contains.
# Changing data type of Weight to double.
myUpdatedSubset$Weight <- as.double(myUpdatedSubset$Weight)
# Changing data type of Height to double.
myUpdatedSubset$Height <- as.double(myUpdatedSubset$Height)
# Changing data type of HitInd to factor
myUpdatedSubset$HitInd <- as.factor(myUpdatedSubset$HitInd)
Having found problems and completed the cleaning of the dataset provided, exploratory data analysis could be carried out
Exploratory data analysis (EDA) is the process of analysing and investigating datasets, summarising their main characteristics and using data visualisation methods to better understand the data (IBM, 2020). With EDA, there are two different summary statistics that could be used to visualise the data: univariate (where the data being analysed consists of one attribute) and multivariate (where the data being analysed – as the name suggest – consists of more than one attribute).
The general process of EDA while not well defined, includes an iterative process of three steps – the first of which is to generate questions about the data. The next step of EDA includes trying to find the answers to the questions generated by visualising, transforming and modelling the data. The final step is to use the finding to refine and generate new questions to make them more precise and help understand the data better. It is important to remember that this process is iterative and involves continuously refining the process to ensure that the questions become more precise and help understand the data better. While the first step of EDA includes generating questions about the data, as Sir David Cox stated, “There are no routine statistical questions, only questionable statistical routines”. It is also thought that coming up with questions that reveal important insights about the data is difficult – since not too much is initially known about the data. Generally, it is though that new questions will lead to new discoveries being made about the data and therefore that every question should be followed up with a new question – to be able to drill down into interesting parts of the data. Due to the fact that there aren’t a set of rules to be followed, as a rule of thumb, questions about the data types, variation and covariation are used as a starting point.
It is important to note that when it comes to EDA, where the data comes from (i.e. is it a reputable source) and if the dataset available is enough (which if it is not – more data should be collected) needs to be considered before starting the process of analysis. This will reduce the chance of problems occurring halfway into the process of data analysis. In this instance however, this step can be ignored as the dataset provided is accepted as it is and any analysis/modelling done based on the data provided.
EDA Questions for this dataset includes:
What datatypes are being used? These includes checking if the data are categorical (which includes being binary in some cases), or numerical (discrete, ordinal or continuous).
What is the variation between variables? This includes looking at the variation between the variables (Player ID, Team ID, Game Count, Runs Count, Hits Count, At Bat Count, Runs Batted In Count, Weight, Height, Salary, Birth Date, Career Length, Bats, Age, Hit Ind) by themselves.
What is the covariation between variables? This includes looking at the variation between two sets of variables (i.e. runs & weight, weight & height, career length & salary, or career length & age).
[Other questions that could have been used for EDA but are not include: Has a similar analysis been done on this dataset? Is there anything that could be learnt from this?]
The next step of EDA includes trying to find the answers to the questions generated by visualising, transforming and modelling the data. The final step is to use the finding to refine and generate new questions to make them more precise and help understand the data better.
It should be noted that questions are a valuable outcome of EDA, therefore, it is still considered to be beneficial to have the outcome of EDA be more questions.
Initially, all of the attributes need be visualised both numerically and graphically.
# Visualising the summary of all the attributes in the new dataset (as the first step of EDA).
summary(myUpdatedSubset)
PlayerID TeamID GamesCount RunsCount HitsCount AtBatCount RunsBattedInCount Weight Height
Length:78 BOS :39 Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. :170.0 Min. :68.00
Class :character TEX :39 1st Qu.: 18.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:200.0 1st Qu.:71.00
Mode :character ALT : 0 Median : 33.00 Median : 0.50 Median : 1.50 Median : 4.5 Median : 0.00 Median :210.0 Median :73.00
ANA : 0 Mean : 51.72 Mean :16.87 Mean : 34.28 Mean :130.2 Mean : 16.72 Mean :211.4 Mean :73.19
ARI : 0 3rd Qu.: 72.25 3rd Qu.:25.50 3rd Qu.: 52.25 3rd Qu.:200.5 3rd Qu.: 25.00 3rd Qu.:223.8 3rd Qu.:75.00
ATL : 0 Max. :160.00 Max. :94.00 Max. :196.00 Max. :613.0 Max. :108.00 Max. :275.0 Max. :79.00
(Other): 0
Salary Birthdate CareerLength Bats Age HitInd
Min. : 443000 Min. :1975-04-03 Min. : 0.2902 B: 6 Min. :20.91 0:36
1st Qu.: 515050 1st Qu.:1983-08-29 1st Qu.: 1.7522 L:24 1st Qu.:26.10 1:42
Median : 1250000 Median :1986-04-21 Median : 4.3491 R:48 Median :28.70
Mean : 5173863 Mean :1985-12-31 Mean : 4.7733 Mean :29.06
3rd Qu.: 8600000 3rd Qu.:1988-11-23 3rd Qu.: 7.3169 3rd Qu.:31.56
Max. :24000000 Max. :1994-02-03 Max. :17.3306 Max. :39.75
Usually, it is during the process of EDA that data is loaded in before beginning to analyse it. Therefore, the ‘head’ and ‘tail’ functions will be used at this stage to visualise the data (and identify any problems on a surface level). This however will not be done on this instance since it was previously done (see Section 1.1).
As part of EDA (exploring and answering the first question), the datatype in the dataset need to be understood. The ‘class’ function can be used for this: [The reason the ‘typeof’ function is not used in this instance is because the high level type of object is what needs to be found. The ‘typeof’ function returns the data type (low level type) of the object (i.e. it returns numeric as opposed to factor if the data type of the values in the factor is numeric).]
# Determine the data type of 'PlayerID'
class(myUpdatedSubset$PlayerID)
[1] "character"
# Determine the data type of 'TeamID'
class(myUpdatedSubset$TeamID)
[1] "factor"
# Determine the data type of 'GamesCount'
class(myUpdatedSubset$GamesCount)
[1] "integer"
# Determine the data type of 'RunsCount'
class(myUpdatedSubset$RunsCount)
[1] "integer"
# Determine the data type of 'HitsCount'
class(myUpdatedSubset$HitsCount)
[1] "integer"
# Determine the data type of 'AtBatCount'
class(myUpdatedSubset$AtBatCount)
[1] "integer"
# Determine the data type of 'RunsBattedInCount'
class(myUpdatedSubset$RunsBattedInCount)
[1] "integer"
# Determine the data type of 'Weight'
class(myUpdatedSubset$Weight)
[1] "numeric"
# Determine the data type of 'Height'
class(myUpdatedSubset$Height)
[1] "numeric"
# Determine the data type of 'Salary'
class(myUpdatedSubset$Salary)
[1] "numeric"
# Determine the data type of 'Birthdate'
class(myUpdatedSubset$Birthdate)
[1] "Date"
# Determine the data type of 'CareerLength'
class(myUpdatedSubset$CareerLength)
[1] "numeric"
# Determine the data type of 'Bats'
class(myUpdatedSubset$Bats)
[1] "factor"
# Determine the data type of 'Age'
class(myUpdatedSubset$Age)
[1] "numeric"
# Determine the data type of 'HitInd'
class(myUpdatedSubset$HitInd)
[1] "factor"
The datatypes of all the attributes are:
Upon visualising the data as a whole and understanding the datatypes, each attribute in the dataset can be looked at in detail.
It is understood that some of the functions run (such as summary) in this section (EDA) have been previously run as well. They have been run again to add completeness to the process of Exploratory Data Analysis
# Visualising the data in 'PlayerID'.
table(myUpdatedSubset$PlayerID)
andruel01 barnema01 bassan01 beltrad01 bettsmo01 bogaexa01 bradlja02 breslcr01 buchhcl01 castiru01 cecchga01 chiriro01 choosh01 claudal01 cogaera01
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
cookry01 corpoca01 craigal01 deazaal01 detwiro01 diekmja01 dysonsa01 edwarjo02 felizne01 fieldpr01 freemsa01 fujikky01 gallayo01 grokkel01 hamelco01
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
hamiljo03 hanigry01 hembrhe01 hollade01 holtbr01 jimenlu02 kellyjo05 kleinph01 layneto01 leonsa01 lewisco01 machije01 martile01 martini01 masteju01
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
mendero01 mileywa01 morelmi01 mujiced01 murphpr04 napolmi01 navada01 odorro01 ogandal01 ortizda01 pedrodu01 perezjj03 perezma02 porceri01 ramirha01
2 1 1 1 1 2 1 1 1 1 1 1 1 1 1
rodriwa01 rosalad01 rossro01 ruary01 sandopa01 schepta01 smolija01 stubbdr01 tazawju01 tollesh01 ueharko01 varvaan01 venabwi01 victosh01 wilsobo02
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
wrighst01
1
The results of visualising the data from ‘PlayerID’ is as expected. Each of the players have unique IDs and in some instances one player appears twice since they can play for more than one team in one season.
# Visualising the data in 'TeamID'.
table(myUpdatedSubset$TeamID)
ALT ANA ARI ATL BAL BFN BFP BL1 BL2 BL3 BL4 BLA BLF BLN BLU BOS BR1 BR2 BR3 BR4 BRF BRO BRP BS1 BS2 BSN BSP BSU BUF CAL CH1 CH2 CHA CHF CHN CHP CHU CIN CL1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CL2 CL3 CL4 CL5 CL6 CLE CLP CN1 CN2 CN3 CNU COL DET DTN ELI FLO FW1 HAR HOU HR1 IN1 IN2 IN3 IND KC1 KC2 KCA KCF KCN KCU KEO LAA LAN LS1 LS2 LS3 MIA MID MIL
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
MIN ML1 ML2 ML3 ML4 MLA MLU MON NEW NH1 NY1 NY2 NY3 NY4 NYA NYN NYP OAK PH1 PH2 PH3 PH4 PHA PHI PHN PHP PHU PIT PRO PT1 PTF PTP RC1 RC2 RIC SDN SE1 SEA SFN
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SL1 SL2 SL3 SL4 SL5 SLA SLF SLN SLU SPU SR1 SR2 TBA TEX TL1 TL2 TOR TRN TRO WAS WIL WOR WS1 WS2 WS3 WS4 WS5 WS6 WS7 WS8 WS9 WSU
0 0 0 0 0 0 0 0 0 0 0 0 0 39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Details of the data in the ‘TeamID’ attribute can be visualised graphically as well (i.e. using a bar chart).
# Visualising the teams in the dataset.
ggplot(myUpdatedSubset) + geom_bar(aes(x = TeamID), fill = 'purple') + theme_bw() + ggtitle("Bar Chart of Teams") + xlab("Team ID") + ylab("Number of Players")
# Team ID
# BOS: Boston Red Sox
# TEX: Texas Rangers
It is clear that this subset contains an equal number of players from both the Boston Red Sox and Texas Rangers teams.
Game Count
# Visualising a summary of the data in TeamID.
summary(myUpdatedSubset$GamesCount)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 18.00 33.00 51.72 72.25 160.00
# Visualising the ‘Games Count’ by plotting a bar chart of the ‘Total Count of Games Played’ vs the ‘Number of Players’.
ggplot(myUpdatedSubset) + geom_bar(aes(x = GamesCount), fill = 'purple') + theme_bw() + ggtitle("Bar Chart of Games Count") + xlab("Total Count of Games Played") + ylab("Number of Players")
Since the bar chart showed that a large portion of the players have a ‘GamesCount’ of zero, another bar chat can be plotted, removing the zero values, to try to gather additional insights.
# Visualising the same plot of 'GamesCount' with the zero values excluded.
ggplot(myUpdatedSubset) + geom_bar(aes(x = GamesCount), fill = 'purple') + theme_bw() + ggtitle("Bar Chart of Runs Count") + xlab("Total Count of Runs") + ylab("Number of Players") +
xlim(1, 170) # The upper limit is determined by looking at the maximum value in this attribute.
RunsCount
# Visualising a summary of the data in 'RunsCount'.
summary(myUpdatedSubset$RunsCount)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.50 16.87 25.50 94.00
# Visualising the ‘Runs Count’ by plotting a bar chart of the ‘Total Count of Runs’ vs the ‘Number of Players’.
ggplot(myUpdatedSubset) + geom_bar(aes(x = RunsCount), fill = 'purple') + theme_bw() + ggtitle("Bar Chart of Runs Count") + xlab("Total Count of Runs") + ylab("Number of Players")
Since the bar chart showed that a large portion of the players have a ‘RunsCount’ of zero, another bar chat can be plotted, removing the zero values, to try to gather additional insights.
# Visualising the same plot of 'RunsCount' with the zero values excluded.
ggplot(myUpdatedSubset) + geom_bar(aes(x = RunsCount), fill = 'purple') + theme_bw() + ggtitle("Bar Chart of Runs Count") + xlab("Total Count of Runs") + ylab("Number of Players") +
xlim(1, 100) # The upper limit is determined by looking at the maximum value in this attribute.
Hits Count
# Visualising a summary of the data in 'HitsCount'
summary(myUpdatedSubset$HitsCount)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 1.50 34.28 52.25 196.00
# Visualising the ‘Hits Count’ by plotting a bar chart of the ‘Total Count of Hits’ vs the ‘Number of Players’.
ggplot(myUpdatedSubset) + geom_bar(aes(x = HitsCount), fill = 'purple') + theme_bw() + ggtitle("Bar Chart of Hits Count") + xlab("Total Count of Hits") + ylab("Number of Players")
Since the bar chart showed that a large portion of the players have a ‘HitsCount’ of zero, another bar chat can be plotted, removing the zero values, to try to gather additional insights.
# Visualising the same plot of 'HitsCount' with the zero values excluded.
ggplot(myUpdatedSubset) + geom_bar(aes(x = HitsCount), fill = 'purple') + theme_bw() + ggtitle("Bar Chart of Hits Count") + xlab("Total Count of Hits") + ylab("Number of Players") +
xlim(1, 200) # The upper limit is determined by looking at the maximum value in this attribute.
At Bat Count
# Visualising a summary of the data in 'AtBatCount'
summary(myUpdatedSubset$AtBatCount)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 4.5 130.2 200.5 613.0
A function that enables only integer values to be displayed on the y axis of plots is used throughout this section.
# Visualising the ‘At Bat’ by plotting a bar chart of the ‘Total Count of At Bat’ vs the ‘Number of Players’.
ggplot(myUpdatedSubset) + geom_bar(aes(x = AtBatCount), fill = 'purple') + theme_bw() + ggtitle("Bar Chart of At Bat") + xlab("Total Count of At Bat") + ylab("Number of Players")
Since the bar chart showed that a large portion of the players have a ‘AtBatCount’ of zero, another bar chat can be plotted, removing the zero values, to try to gather additional insights.
# Visualising the same plot of 'HitsCount' with the zero values excluded.
ggplot(myUpdatedSubset) + geom_bar(aes(x = AtBatCount), fill = 'purple') + theme_bw() + ggtitle("Bar Chart of At Bat") + xlab("Total Count of At Bat") + ylab("Number of Players") +
xlim(1, 620) # The upper limit is determined by looking at the maximum value in this attribute.
Runs Batted In Count
# Visualising a summary of the data in 'RunsBattedInCount'
summary(myUpdatedSubset$RunsBattedInCount)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 16.72 25.00 108.00
# Visualising the ‘Runs Batted In’ by plotting a bar chart of the ‘Total Count of Runs Batted In’ vs the ‘Number of Players’.
ggplot(myUpdatedSubset) + geom_bar(aes(x = RunsBattedInCount), fill = 'purple') + theme_bw() + ggtitle("Bar Chart of Runs Batted In") + xlab("Total Count of Runs Batted In") + ylab("Number of Players")
Since the bar chart showed that a large portion of the players have a ‘RunsBattedInCount’ of zero, another bar chat can be plotted, removing the zero values, to try to gather additional insights.
# Visualising the same plot of 'RunsBattedInCount' with the zero values excluded.
ggplot(myUpdatedSubset) + geom_bar(aes(x = RunsBattedInCount), fill = 'purple') + theme_bw() + ggtitle("Bar Chart of Runs Batted In") + xlab("Total Count of Runs Batted In") + ylab("Number of Players") +
xlim(1, 110) # The upper limit is determined by looking at the maximum value in this attribute.
The visualisation of the Game Count, Runs Count, Hits Count, At Bat Count, and Runs Batted In Count showed that a majority of the data in these attributes were zero values. When these zero values were removed and the plots visualised again, it was found that the scores were spread out with a higher number of values being on the lower end of the spectrum.
Weight
# Visualising a summary of the data in 'Weight'.
summary(myUpdatedSubset$Weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
170.0 200.0 210.0 211.4 223.8 275.0
# Visualising the ‘Weight’ attribute by plotting a histogram of the number of players with different weights.
ggplot(myUpdatedSubset, aes(x=Weight)) + geom_histogram(color="purple", fill="purple", binwidth=2) + theme_bw() + ggtitle("Histogram of Weight") + xlab("Weight of Players") + ylab("Number of Players")
# Visualising the ‘Weight’ attribute - along with the density function to further understand the its distribution.
ggplot(myUpdatedSubset, aes(x=Weight)) + geom_histogram(aes(y=..density..), color="purple", fill="purple", binwidth=2) + geom_density(alpha=.2, fill="black") + theme_bw() + ggtitle("Histogram of Weight") + xlab("Weight of Players") + ylab("Number of Players (normalised)")
Height
# Visualising a summary of the data in 'Height'.
summary(myUpdatedSubset$Height)
Min. 1st Qu. Median Mean 3rd Qu. Max.
68.00 71.00 73.00 73.19 75.00 79.00
# Visualising the ‘Height’ attribute by plotting a histogram of the number of players with different height.
ggplot(myUpdatedSubset, aes(x=Height)) + geom_histogram(color="purple", fill="purple", binwidth=0.5) + theme_bw() + ggtitle("Histogram of Height") + xlab("Height of Players") + ylab("Number of Players")
# Visualising the ‘Height’ attribute - along with the density function to further understand the its distribution.
ggplot(myUpdatedSubset, aes(x=Height)) + geom_histogram(aes(y=..density..), color="purple", fill="purple", binwidth=0.5) + geom_density(alpha=.2, fill="black") + theme_bw() + ggtitle("Histogram of Height") + xlab("Height of Players") + ylab("Number of Players (normalised)")
The weight and height of the players were visualised numerically, including looking at the mean, median, range, and interquartile range. Afterwards, the weight and the height of players were visualised graphically using the histogram and it was found that they are distributed normally (for the most part). It is also understood that there might not have been enough data to form a ‘perfect’ normal distribution. Later, density lines were added to the plots (which were essentially smoothed version of the histogram). On the whole, there isn’t anything notable about the weight or height.
# Visualising a summary of the data in 'Salary'.
summary(myUpdatedSubset$Salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
443000 515050 1250000 5173863 8600000 24000000
# Visualising the ‘Salary’ attribute by plotting a bar chart.
ggplot(myUpdatedSubset, aes(x=Salary)) + geom_bar(color="purple", fill="purple") + theme_bw() + ggtitle("Histogram of Players' Salary") + xlab("Salary of Players") + ylab("Number of Players (with Salary)")
Upon visualising the salary attribute from the dataset, it was found that quite a few players had a salary that was classified as being from the lower end of the spectrum whereas a few players had an ‘extremely high’ salary. It should be noted that this was not evenly distributed as well, with only a few players having a very high salary. Since this dataset contains data from the real world, which is known to have a skewed salary distribution (Thewissen, et al., 2015), it is understandable that this is the distribution of salary.
Birth Date
# Visualising a summary of the data in 'BirthDate'.
summary(myUpdatedSubset$BirthDate)
Length Class Mode
0 NULL NULL
Career Length
# Visualising a summary of the data in 'CareerLength'.
summary(myUpdatedSubset$CareerLength)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2902 1.7522 4.3491 4.7733 7.3169 17.3306
# Visualising the ‘CareerLength’ attribute by plotting a bar chart.
ggplot(myUpdatedSubset, aes(x=CareerLength)) + geom_bar(color="purple", fill="purple") + theme_bw() + ggtitle("Bar Chart of CareerLength") + xlab("Career Length of Players") + ylab("Number of Players")
Age
# Visualising a summary of the data in 'Age'.
summary(myUpdatedSubset$Height)
Min. 1st Qu. Median Mean 3rd Qu. Max.
68.00 71.00 73.00 73.19 75.00 79.00
# Visualising the ‘Age’ attribute by plotting a bar chart.
ggplot(myUpdatedSubset, aes(x=Age)) + geom_bar(color="purple", fill="purple") + theme_bw() + ggtitle("Bar Chart of Players' Age") + xlab("Age of Players") + ylab("Number of Players")
Seeing at the date of birth is different for each player, further analysis cannot be carried out on this attribute. There is a range of different ages and career lengths available for the different players. It should be noted that since the age and career length of players are represented as double values, almost none of the players have the same age or career length – which would not be the case if these numbers were integer. Not much can be said about these attributes when visualising them individually, therefore, further analysis would be carried out looking at these attributes along with others.
# Visualising a summary of the data in 'Bats'.
summary(myUpdatedSubset$Bats)
B L R
6 24 48
# Visualising the ‘Bats’ attribute by plotting a bar chart.
ggplot(myUpdatedSubset, aes(x=Bats)) + geom_bar(color="purple", fill="purple") + theme_bw() + ggtitle("Bar Chart of Batting Hand") + xlab("Batting Hand of Players") + ylab("Number of Players")
# Batting Hand
# B: Both (i.e. ambidextrous players)
# L: Left
# R: Right
Visualising the data in the ‘Bats’ attribute showed that a majority of the players used their right hand to play, with only half as many players using their left hand, and only a very small portion of players using both. The ratio of this can be written as follows: 1:4:8 [Both: Left: Right].
# Visualising a summary of the data in 'HitInd'.
table(myUpdatedSubset$HitInd)
0 1
36 42
# Visualising the ‘HitInd’ attribute by plotting a bar chart.
ggplot(myUpdatedSubset, aes(x=HitInd)) + geom_bar(color="purple", fill="purple") + theme_bw() + ggtitle("Bar Chart of Hit Ind") + xlab("Hit Ind") + ylab("Number of Players")
Upon visualising the data in the ‘HitInd’ column, it can be said that almost a similar number of players have scored or not scored a hit in the 2015 season. This attribute by itself might not provide a lot of insights, however, the correlation between ‘HitInd’ and other variables might need to be looked at.
Having visualised each attribute separately, multiple attributes and their relationships between each other can be looked at
# Using an 'int_breaks' function to ensure that only integer values are displayed on the y axis of plots (where it is not possible to have double values).
# [source: https://stackoverflow.com/questions/15622001/how-to-display-only-integer-values-on-an-axis-using-ggplot2]
int_breaks <- function(x, n = 100) {
l <- pretty(x, n)
l[abs(l %% 1) < .Machine$double.eps ^ 0.5]
}
It should be noted that the process of EDA is not to ‘thoughtlessly’ visualise plots for all of the different attributes against all of the other attributes, but rather to understand more about the data to be able to carry out further analysis and modelling. In this case, research questions involve salary and hitind. It was found that hits count is highly correlated to a few other attributes during the data cleaning, therefore, this could be explored. Further, since it was found that career length has an effect on the salary (and the salary was going to be used for modelling), it was decided that this was going to be explored as well. Since there are two teams in the dataset, it was thought that understanding any significant differences between the teams would be important. In summary, the attributes that are going to be explored further include:
Since data about players from two teams (Boston Red Sox & Texas Rangers) were available, the variation of all the other attributes compared to the players of the two teams can be looked at. This will show if players from one team have attributes that are significantly different to the other team or if both teams have attributes that are similar. To do this, box plots can be used, where all of the different attributes are plotted (separately) against the Team ID.
# Visualising the Variation of 'GamesCount' between the two teams using a box plot.
There were 50 or more warnings (use warnings() to see the first 50)
ggplot(data = myUpdatedSubset, aes(x = TeamID, y = GamesCount)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'GamesCount' between the two teams") + xlab("Team ID") + ylab("Players") + theme(legend.position="none")
# Visualising the Variation of 'RunsCount' between the two teams using a box plot.
ggplot(data = myUpdatedSubset, aes(x = TeamID, y = RunsCount)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'RunsCount' between the two teams") + xlab("Team ID") + ylab("Players") + theme(legend.position="none")
# Visualising the Variation of 'HitsCount' between the two teams using a box plot.
ggplot(data = myUpdatedSubset, aes(x = TeamID, y = HitsCount)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'HitsCount' between the two teams") + xlab("Team ID") + ylab("Players") + theme(legend.position="none")
# Visualising the Variation of 'AtBatCount' between the two teams using a box plot.
ggplot(data = myUpdatedSubset, aes(x = TeamID, y = AtBatCount)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'AtBatCount' between the two teams") + xlab("Team ID") + ylab("Players") + theme(legend.position="none")
# Visualising the Variation of 'RunsBattedInCount' between the two teams using a box plot.
ggplot(data = myUpdatedSubset, aes(x = TeamID, y = RunsBattedInCount)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'RunsBattedInCount' between the two teams") + xlab("Team ID") + ylab("Players") + theme(legend.position="none")
# Visualising the Variation of 'Weight' between the two teams using a box plot.
ggplot(data = myUpdatedSubset, aes(x = TeamID, y = Weight)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'Weight' between the two teams") + xlab("Team ID") + ylab("Players") + theme(legend.position="none")
# Visualising the Variation of 'Height' between the two teams using a box plot.
ggplot(data = myUpdatedSubset, aes(x = TeamID, y = Height)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'Height' between the two teams") + xlab("Team ID") + ylab("Players") + theme(legend.position="none")
# Visualising the Variation of 'Salary' between the two teams using a box plot.
ggplot(data = myUpdatedSubset, aes(x = TeamID, y = Salary)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'Salary' between the two teams") + xlab("Team ID") + ylab("Players") + theme(legend.position="none")
# Visualising the Variation of 'CareerLength' between the two teams using a box plot.
ggplot(data = myUpdatedSubset, aes(x = TeamID, y = CareerLength)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'CareerLength' between the two teams") + xlab("Team ID") + ylab("Players") + theme(legend.position="none")
# Visualising the Variation of 'CareerLength' between the two teams using a Frequency Polygon.
ggplot(data = myUpdatedSubset, mapping = aes(x = CareerLength, colour = TeamID)) + geom_freqpoly(binwidth = 0.1) + theme_bw() + ggtitle("Variation of 'CareerLength' between the two teams") + xlab("Career Length of Players") + ylab("Players") + scale_y_continuous(breaks = int_breaks)
# Visualising the Variation of 'Age' between the two teams using a box plot.
ggplot(data = myUpdatedSubset, aes(x = TeamID, y = Age)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'Age' between the two teams") + xlab("Team ID") + ylab("Players") + theme(legend.position="none")
# Visualising the Variation of 'Age' between the two teams using a Frequency Polygon.
ggplot(data = myUpdatedSubset, mapping = aes(x = Age, colour = TeamID)) + geom_freqpoly(binwidth = 0.1) + theme_bw() + ggtitle("Variation of 'Age' between the two teams") + xlab("Age of Players") + ylab("Players") + scale_y_continuous(breaks = int_breaks)
Upon visualising the variation between the teams using box plots, the following can be said. It was found that ‘Games’, ‘Runs’, ‘Hits’, ‘AtBat’, and ‘RunsBattedIn’ all had similar counts between the two teams, with BOS having a slightly higher (almost insignificant) variation for almost each of these attributes. The mean values of TEX was also slightly higher for each attribute, but this was insignificant as well. In terms of the weight and height, there is a clear difference, with the TEX team having an overall higher weight and height. The overall variation for both the teams are the same, so it can be understood that generally players in the TEX team are taller and weigh more than players in the BOS team. While the mean of the salaries are the same, there is a much higher variation in the salaries of player in the BOS team, which might be insightful for the modelling process. The career lengths of players in both teams were almost the same, therefore, a frequency polygon was used to explore this, which further proved the previous findings. While there was a difference in the mean ages of the players in both teams, the variation was very similar.
On the whole, only the variation in salary is notable – which might be useful during the modelling process.
The hits attribute can be visualised using a scatter plot, comparing it to all other attributes.
# Visualising the attributes by using a scatter plot of 'HitsCount' vs 'GamesCount'.
ggplot(data = myUpdatedSubset, aes(x = HitsCount, y = GamesCount)) + geom_point(shape=4, color = "black") + theme_bw() + ggtitle("Scatterplot of 'HitsCount' vs 'GamesCount'") + xlab("Count of Hits") + ylab("Count of Games")
# Visualising the attributes by using a scatter plot of 'HitsCount' vs 'RunsCount'.
ggplot(data = myUpdatedSubset, aes(x = HitsCount, y = RunsCount)) + geom_point(shape=4, color = "black") + theme_bw() + ggtitle("Scatterplot of HitsCount' vs 'RunsCount'") + xlab("Count of Hits") + ylab("Count of Runs")
# Visualising the attributes by using a scatter plot of 'HitsCount' vs 'AtBatCount'.
ggplot(data = myUpdatedSubset, aes(x = HitsCount, y = AtBatCount)) + geom_point(shape=4, color = "black") + theme_bw() + ggtitle("Scatterplot of 'HitsCount' vs 'AtBatCount'") + xlab("Count of Hits") + ylab("Count of At Bat")
# Visualising the attributes by using a scatter plot of 'HitsCount' vs 'RunsBattedInCount'.
ggplot(data = myUpdatedSubset, aes(x = HitsCount, y = RunsBattedInCount)) + geom_point(shape=4, color = "black") + theme_bw() + ggtitle("Scatterplot of 'HitsCount' vs 'RunsBattedInCount'") + xlab("Count of Hits") + ylab("Count of Runs Batted In")
As expected, the plots showed that hits is highly correlated with games, runs, at bat, and runs batted in, with all of them having almost a liner relationship. It was also found that a number of players had these counts be zero, which is understandable since a big portion of the players has a hits, games, runs, at bat, and runs batted in count of zero.
Next, all of the other attributes can be visualised against the hits.
# Visualising the attributes by using a scatter plot of 'HitsCount' vs 'Weight'.
ggplot(data = myUpdatedSubset, aes(x = HitsCount, y = Weight)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'HitsCount' vs 'Weight'") + xlab("Count of Hits") + ylab("Weight")
# Visualising the attributes by using a scatter plot of 'HitsCount' vs 'Height'.
ggplot(data = myUpdatedSubset, aes(x = HitsCount, y = Height)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'HitsCount' vs 'Height'") + xlab("Count of Hits") + ylab("Height")
# Visualising the attributes by using a scatter plot of 'HitsCount' vs 'Salary'.
ggplot(data = myUpdatedSubset, aes(x = HitsCount, y = Salary)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'HitsCount' vs 'Salary'") + xlab("Count of Hits") + ylab("Salary")
# Visualising the attributes by using a scatter plot of 'HitsCount' vs 'CareerLength'.
ggplot(data = myUpdatedSubset, aes(x = HitsCount, y = CareerLength)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'HitsCount' vs 'CareerLength'") + xlab("Count of Hits") + ylab("Career Length")
# Visualising the attributes by using a scatter plot of 'HitsCount' vs 'Age'.
ggplot(data = myUpdatedSubset, aes(x = HitsCount, y = Age)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'HitsCount' vs 'Age'") + xlab("Count of Hits") + ylab("Age")
There was nothing significant that was found through visualising any of these scatter plots. Most of the points were distributed randomly throughout, with there being almost no correlation between the hits count and any of these variables.
Seeing as a model was going to be created using the salary of players, it was decided that the salary was going to be explored against other attributes. Since analysis was previously done, exploring the salary vs other attributes, only attributes that weren’t explored were used in the analysis at this stage.
# Visualising the attributes by using a scatter plot of 'Salary' vs 'Weight'.
ggplot(data = myUpdatedSubset, aes(x = Salary, y = Weight)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'Salary' vs 'Weight'") + xlab("Salary") + ylab("Weight")
# Visualising the attributes by using a scatter plot of 'Salary' vs 'Height'.
ggplot(data = myUpdatedSubset, aes(x = Salary, y = Height)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'Salary' vs 'Height'") + xlab("Salary") + ylab("Height")
# Visualising the attributes by using a scatter plot of 'Salary' vs 'CareerLength'.
ggplot(data = myUpdatedSubset, aes(x = Salary, y = CareerLength)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'Salary' vs 'CareerLength'") + xlab("Salary") + ylab("Career Length")
# Visualising the attributes by using a scatter plot of 'Salary' vs 'Age'.
ggplot(data = myUpdatedSubset, aes(x = Salary, y = Age)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'Salary' vs 'Age'") + xlab("Salary") + ylab("Age")
These plots showed that the weight, height, and age has almost no effect on the salary, with the points being randomly spread out. Although not a one to one relationship, it was clear that the career length and salary were correlated (i.e. an increase in the career length lead to an increase in the salary). This needs to be kept in mind when building the salary model, since career length needs to be one of the attributes that are definitely included in the model.
Since career length and salary are correlated (which was further confirmed by looking at the plots), it was decide that career length would be explored further to understand which other attributes affect the career length.
# Visualising the attributes by using a scatter plot of 'CareerLength' vs 'Weight'.
ggplot(data = myUpdatedSubset, aes(x = CareerLength, y = Weight)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'Career Length' vs 'Weight'") + xlab("Career Length") + ylab("Weight")
# Visualising the attributes by using a scatter plot of 'CareerLength' vs 'Height'.
ggplot(data = myUpdatedSubset, aes(x = CareerLength, y = Height)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'Career Length' vs 'Height'") + xlab("Career Length") + ylab("Height")
# Visualising the attributes by using a scatter plot of 'CareerLength' vs 'Birthdate'.
ggplot(data = myUpdatedSubset, aes(x = CareerLength, y = Birthdate)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'Career Length' vs 'Birthdate'") + xlab("Career Length") + ylab("Birthdate")
# Visualising the attributes by using a scatter plot of 'CareerLength' vs 'Age'.
ggplot(data = myUpdatedSubset, aes(x = CareerLength, y = Age)) + geom_point(shape=1, color = "black") + theme_bw() + ggtitle("Scatterplot of 'Career Length' vs 'Age'") + xlab("Career Length") + ylab("Age")
The scatter plots showed that the career length of players was not affected by the weight or height of players. Both the scatter plots of weight and height had points that were distributed randomly throughout, with nothing that was significant or notable. The plot of career length vs date of birth showed that there was an inverse relationship between these two attributes (i.e. the further away the date of birth from the current day [2015 season], the longer the career length). This makes sense since players that were born earlier are likely to have a longer career length since they are older. This plot of age and career length also showed a significant relationship, with the career length increasing along with age. This is untestable as well since older players are likely to have a longer career length.
Since the second research question is based around the ‘HitInd’ (i.e. building a logistic regression model using ‘HitInd’), it was decided that this attribute would need to be explored further.
# Visualising the Variation of 'GamesCount' compared to the 'HitInd' using a box plot.
ggplot(data = myUpdatedSubset, aes(x = HitInd, y = GamesCount)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'GamesCount' compared to the 'HitInd'") + xlab("Hit Ind") + ylab("Games Count") + theme(legend.position="none")
# Visualising the Variation of 'RunsCount' compared to the 'HitInd' using a box plot.
ggplot(data = myUpdatedSubset, aes(x = HitInd, y = RunsCount)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'RunsCount' compared to the 'HitInd'") + xlab("Hit Ind") + ylab("Runs Count") + theme(legend.position="none")
# Visualising the Variation of 'HitsCount' compared to the 'HitInd' using a box plot.
ggplot(data = myUpdatedSubset, aes(x = HitInd, y = HitsCount)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'HitsCount' compared to the 'HitInd'") + xlab("Hit Ind") + ylab("Hits Count") + theme(legend.position="none")
# Visualising the Variation of 'AtBatCount' compared to the 'HitInd' using a box plot.
ggplot(data = myUpdatedSubset, aes(x = HitInd, y = AtBatCount)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'AtBatCount' compared to the 'HitInd'") + xlab("Hit Ind") + ylab("At Bat Count") + theme(legend.position="none")
# Visualising the Variation of 'RunsBattedInCount' compared to the 'HitInd' using a box plot.
ggplot(data = myUpdatedSubset, aes(x = HitInd, y = RunsBattedInCount)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'RunsBattedInCount' compared to the 'HitInd'") + xlab("Hit Ind") + ylab("Runs Batted In Count") + theme(legend.position="none")
# Visualising the Variation of 'Weight' compared to the 'HitInd' using a box plot.
ggplot(data = myUpdatedSubset, aes(x = HitInd, y = Weight)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'Weight' compared to the 'HitInd'") + xlab("Hit Ind") + ylab("Weight") + theme(legend.position="none")
# Visualising the Variation of 'Height' compared to the 'HitInd' using a box plot.
ggplot(data = myUpdatedSubset, aes(x = HitInd, y = Height)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'Height' compared to the 'HitInd'") + xlab("Hit Ind") + ylab("Height") + theme(legend.position="none")
# Visualising the Variation of 'Salary' compared to the 'HitInd' using a box plot.
ggplot(data = myUpdatedSubset, aes(x = HitInd, y = Salary)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'Salary' compared to the 'HitInd'") + xlab("Hit Ind") + ylab("Salary") + theme(legend.position="none")
# Visualising the Variation of 'CareerLength' compared to the 'HitInd' using a box plot.
ggplot(data = myUpdatedSubset, aes(x = HitInd, y = CareerLength)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'CareerLength' compared to the 'HitInd'") + xlab("Hit Ind") + ylab("Career Length") + theme(legend.position="none")
# Visualising the Variation of 'Age' compared to the 'HitInd' using a box plot.
ggplot(data = myUpdatedSubset, aes(x = HitInd, y = Age)) + geom_boxplot(color="blue", fill="purple", alpha=0.2) + theme_bw() + ggtitle("Variation of 'Age' compared to the 'HitInd'") + xlab("Hit Ind") + ylab("Age") + theme(legend.position="none")
The plot of ‘GamesCount’ showed that the number of games played by a player was, on average, higher for players that scored a hit. This makes sense since players that scored at least one hit in the season was more likely to be in more games. Looking at the plots of ‘RunsCount’, ‘HitsCount’, ‘AtBatCount’, and ‘RunsBattedInCount’ showed that any player that did not score a hit in the season had zero values for all of these attributes, while players that scored at least one hit in a given season (2015) had a variation of values for these attributes. This makes sense as well since a player needs to score a hit before their runs, hits, at bat, or runs batted in counts increase. The plot of weight showed that there was not a big difference in variation between a player that scores a hit in a season and one that does not. Therefore, it can be understood that weight might not have a big impact on the ability to score a hit (and might not need to be included in the model). The plot of height showed that shorter players might have a higher chance of scoring a hit in a given season – which might need to be considered during the modelling stage. Using the plot, it can be understood that the salary of a player definitely does have an impact on the overall ability for a player to make a hit in the season. This makes sense since it is likely that ‘good’ players that are able to make a hit get a higher salary. The plot of career length also shows players with a longer career length have scored a hit, which means that an increased career length might have an impact on the ability to score a hit. This makes sense when thought about logically, since players with more experience would be more likely to score a hit. While not significant, the plot of age also showed that older players are more likely to score a hit than a younger player.
Looking at all of these plots, it can be said that the count of games, runs, hits, at bat, runs batted in, height, salary, career length, and age might have an effect on a player’s ability to score a hit. These insights might be useful when building a model of ‘HitInd’.
Additional insights and issues relating to Exploratory Data Analysis were addressed in the previous section itself (Section 2.2), with further analysis being carried out whenever it was needed.
Before carrying out any analysis or modelling, it should be noted that statistical analysis is an iterative process where all of the different stages (which include: defining/understanding the problem at hand, planning the method to go about solving it, finding and/or collecting the data needed, carrying out analysis on the data, and drawing conclusions from this analysis) needs to be considered from end to end.
When it comes to building a model given the research question (i.e. target attribute of salary), the best approach would be to use multiple regression since there is more than one attribute (GamesCount, RunsCount, HitsCount, AtBatCount, RunsBattedInCount, Weight, Height, and so on) that might affect the dependent variable (Salary). Linear regression is the process of modelling two numerical variables (one explanatory variable and one dependent variable) to be able to understand their relationship and use this for further analysis. Multiple regression is an extension of simple linear regression (Tranmer & Elliot, 2008) where multiple variables are modelled (multiple explanatory variables and one dependent variable).
A typical model of this kind could be represented as follows:
\[Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} +\ ...\ + \beta_{k}X_{k} + \varepsilon\]
Where \(\varepsilon\) is the error term, E(\(\varepsilon\))=0 and Var(\(\varepsilon\))=\(\sigma^{2}\)
It should be noted that all of the variables/attributes (both the dependent variable and all of the explanatory variables) needs to be numerical when building a model of this sort.
Seeing as all of the attributes that are going to be used in the model needs to be numerical, a new subset can be created that contains only numerical variables. This can be done in a number of different ways, with two methods being shown below:
# Creating new modelling subset.
myModellingSubset <- myUpdatedSubset
# Removing non-numeric attributes.
myModellingSubset$PlayerID <- NULL
myModellingSubset$TeamID <- NULL
myModellingSubset$Birthdate <- NULL
myModellingSubset$Bats <- NULL
# Changing data type of HitInd - to be used in the model.
myModellingSubset$HitInd <- as.double(myModellingSubset$HitInd)
# An alternative method of creating a subset with only numeric attributes.
#myModellingSubset <- myUpdatedSubset[,sapply(myUpdatedSubset, is.numeric)]
# Visualising the new subset.
head(myModellingSubset)
Before beginning the process of modelling, it should be noted that multiple regression is sometimes considered to be one of the most statistical models with the process of carrying out multiple regression being considered ‘challenging’ as well. It is also important to consider several important statistical issues (Crawley, 2015) relating to multiple regression before beginning the process of modelling.
The data cleaning and EDA previously carried out gave insights into the different attributes of the dataset. This also showed the correlations between some of the variables, making it possible to only include one highly correlated variable in the model. EDA carried out also gave further insights into the target attribute (Salary) when compared to all of the other attributes.
One of those issues that is going to affect the modelling of this dataset is the one about (numerical) explanatory variables being highly correlated to each other. Through the process of data cleaning and EDA, it was found that the ‘GamesCount’, ‘RunsCount’, ‘HitsCount’, ‘AtBatCount’, and ‘RunsBattedInCount’ are all highly correlated to each other. Therefore, it was understood that only one of these variables could be used in the model.
While it was also found that the salaries of the two teams might be different on average, the team id was not used on this model (since it contained only numerical variables).
Multiple regression is where the notion that “all models are wrong, but some are useful” is used, with it being the job of a statistician/data scientist to find a useful model from a given dataset.
There are a number of ways of starting the process of multiple regression and a number of ways to categorise models (null model, minimal adequate model, current model, and so on) with the minimal adequate model being the ideal outcome of this modelling process. It is thought that the minimal adequate model is one that has a good enough r-squared value (more details of which can be found below), but does not contain anything ‘unnecessary’.
When it comes to building a multiple regression model given a research question, a number of steps will need to be followed. The first step usually is to load the data correctly, which has been done already in this instance. The next step usually is to visualise the dataset numerically and graphically – for which the ‘summary’ function will be used. The data from this dataset was already visualised graphically in the previous section (with the findings being reported as well), therefore, further graphical visualisation does not need to be carried out.
# Visualising a sumamry of the dataset.
summary(myModellingSubset)
GamesCount RunsCount HitsCount AtBatCount RunsBattedInCount Weight Height Salary CareerLength
Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. :170.0 Min. :68.00 Min. : 443000 Min. : 0.2902
1st Qu.: 18.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:200.0 1st Qu.:71.00 1st Qu.: 515050 1st Qu.: 1.7522
Median : 33.00 Median : 0.50 Median : 1.50 Median : 4.5 Median : 0.00 Median :210.0 Median :73.00 Median : 1250000 Median : 4.3491
Mean : 51.72 Mean :16.87 Mean : 34.28 Mean :130.2 Mean : 16.72 Mean :211.4 Mean :73.19 Mean : 5173863 Mean : 4.7733
3rd Qu.: 72.25 3rd Qu.:25.50 3rd Qu.: 52.25 3rd Qu.:200.5 3rd Qu.: 25.00 3rd Qu.:223.8 3rd Qu.:75.00 3rd Qu.: 8600000 3rd Qu.: 7.3169
Max. :160.00 Max. :94.00 Max. :196.00 Max. :613.0 Max. :108.00 Max. :275.0 Max. :79.00 Max. :24000000 Max. :17.3306
Age HitInd
Min. :20.91 Min. :1.000
1st Qu.:26.10 1st Qu.:1.000
Median :28.70 Median :2.000
Mean :29.06 Mean :1.538
3rd Qu.:31.56 3rd Qu.:2.000
Max. :39.75 Max. :2.000
Since this subset – created specifically for modelling – contains only numerical continuous values, the correlation between the attributes can be found. It is known that the correlation between some of the attributes were found during the data cleaning and EDA stages, however, it is thought that looking at the correlation between all the attributes on the whole is important.
# Visualising the correlation between the attributes.
cor(myModellingSubset)
GamesCount RunsCount HitsCount AtBatCount RunsBattedInCount Weight Height Salary CareerLength Age
GamesCount 1.000000000 0.91735674 0.928854838 0.93399912 0.90615888 0.049338900 -0.38377370 0.37396982 0.29326208 0.007089168
RunsCount 0.917356742 1.00000000 0.980981751 0.98061258 0.96143991 0.066939637 -0.41574859 0.39599554 0.28467829 0.025444465
HitsCount 0.928854838 0.98098175 1.000000000 0.99285909 0.96655549 0.094965435 -0.40345947 0.40935239 0.28381116 0.009406063
AtBatCount 0.933999124 0.98061258 0.992859092 1.00000000 0.96249205 0.104071106 -0.41312302 0.42088650 0.28505279 0.022103148
RunsBattedInCount 0.906158885 0.96143991 0.966555492 0.96249205 1.00000000 0.146342816 -0.33078136 0.42625081 0.34887083 0.080750210
Weight 0.049338900 0.06693964 0.094965435 0.10407111 0.14634282 1.000000000 0.36519581 0.23902965 0.15470397 0.067977601
Height -0.383773700 -0.41574859 -0.403459467 -0.41312302 -0.33078136 0.365195815 1.00000000 -0.01158694 0.00508644 -0.065589083
Salary 0.373969816 0.39599554 0.409352391 0.42088650 0.42625081 0.239029646 -0.01158694 1.00000000 0.66313012 0.408074570
CareerLength 0.293262076 0.28467829 0.283811164 0.28505279 0.34887083 0.154703969 0.00508644 0.66313012 1.00000000 0.700655887
Age 0.007089168 0.02544446 0.009406063 0.02210315 0.08075021 0.067977601 -0.06558908 0.40807457 0.70065589 1.000000000
HitInd 0.524298779 0.60139102 0.590856611 0.62772750 0.56348767 0.002645485 -0.40325001 0.41314184 0.32002283 0.189403193
HitInd
GamesCount 0.524298779
RunsCount 0.601391017
HitsCount 0.590856611
AtBatCount 0.627727500
RunsBattedInCount 0.563487671
Weight 0.002645485
Height -0.403250011
Salary 0.413141838
CareerLength 0.320022828
Age 0.189403193
HitInd 1.000000000
As part of visualising the data in this new subset, a matrix of scatterplots with all of the different attributes could be looked at to gather further insights.
# Visualising a matrix of scatterplots (for all of the different attributes).
pairs(myModellingSubset,panel=panel.smooth)
Upon looking at the correlations between the attributes, it was found that the attributes ‘GamesCount’, ‘RunsCount’, ‘HitsCount’, ‘AtBatCount’, and ‘RunsBattedInCount’ were highly correlated (as previously discovered as well). Due to this multi collinearity, it was decided that only one of these explanatory variables would be used in the model. While not as significant, there was some correlation found between the ‘CareerLength’ and ‘Age’ attributes as well.
The correlation between salary and all of the highly correlated variables (‘GamesCount’, ‘RunsCount’, ‘HitsCount’, ‘AtBatCount’, and ‘RunsBattedInCount’) can be calculated separately as well – to determine which attribute to use in the model. The lines of code used for this have been commented out since the previously used ‘cor’ function calculated the correlation between all of the attributes.
# Calculating the correlation between Salary & GamesCount.
#cor.test(myModellingSubset$Salary, myModellingSubset$GamesCount)
# Calculating the correlation between Salary & RunsCount.
#cor.test(myModellingSubset$Salary, myModellingSubset$RunsCount)
# Calculating the correlation between Salary & HitsCount.
#cor.test(myModellingSubset$Salary, myModellingSubset$HitsCount)
# Calculating the correlation between Salary & AtBatCount.
#cor.test(myModellingSubset$Salary, myModellingSubset$AtBatCount)
# Calculating the correlation between Salary & RunsBattedInCount.
#cor.test(myModellingSubset$Salary, myModellingSubset$RunsBattedInCount)
It was decided that the ‘RunsBattedInCount’ would be used in the initial model. If an appropriate model is not found, a different attribute could be built later on.
As the initial step, all of the continuous explanatory variables would be used to build a maximal model for the baseball dataset.
# Building a maximal model for the baseball dataset.
baseballMaxModel <- lm(Salary~RunsBattedInCount+Weight+Height+CareerLength+Age+HitInd, data=myModellingSubset)
# Visualising a summary of the model created.
summary(baseballMaxModel)
Call:
lm(formula = Salary ~ RunsBattedInCount + Weight + Height + CareerLength +
Age + HitInd, data = myModellingSubset)
Residuals:
Min 1Q Median 3Q Max
-8773088 -2681172 -767468 1558104 13677268
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21603272 21498103 -1.005 0.3184
RunsBattedInCount 30258 26487 1.142 0.2571
Weight 33876 28045 1.208 0.2311
Height 169288 287055 0.590 0.5572
CareerLength 999586 231124 4.325 4.89e-05 ***
Age -66053 201849 -0.327 0.7444
HitInd 2513527 1389314 1.809 0.0747 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4746000 on 71 degrees of freedom
Multiple R-squared: 0.5197, Adjusted R-squared: 0.4792
F-statistic: 12.81 on 6 and 71 DF, p-value: 9.575e-10
Having built an initial model, the model characteristics (including the goodness of fit) and relevant diagnostics can be looked at to critique the model
Usually after building a model, the model characteristics (including the goodness of fit) can be looked at to understand the model better. This can be done by using the ‘summary’ function - which produces insights about the model. While the ‘summary’ function produces a number of statistics relating to the model, a number of main values need to be looked at.
In addition to looking at different statistics, diagnostic plots (Portugués, 2020) can be looked at to gather further insights about the model.
This can be done by using the ‘plot’ function which usually produces four graphs:
To start investigating the model further, the summary function could be used.
# Visualising a summary of the model created.
summary(baseballMaxModel)
Call:
lm(formula = Salary ~ RunsBattedInCount + Weight + Height + CareerLength +
Age + HitInd, data = myModellingSubset)
Residuals:
Min 1Q Median 3Q Max
-8773088 -2681172 -767468 1558104 13677268
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21603272 21498103 -1.005 0.3184
RunsBattedInCount 30258 26487 1.142 0.2571
Weight 33876 28045 1.208 0.2311
Height 169288 287055 0.590 0.5572
CareerLength 999586 231124 4.325 4.89e-05 ***
Age -66053 201849 -0.327 0.7444
HitInd 2513527 1389314 1.809 0.0747 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4746000 on 71 degrees of freedom
Multiple R-squared: 0.5197, Adjusted R-squared: 0.4792
F-statistic: 12.81 on 6 and 71 DF, p-value: 9.575e-10
The results produced is as follows:
Using these results, it can be said that this model has an F score and an r squared value that is considered to be fine (not great). Further, it is clear that some of the attributes in the model are not significant, concluding that the model needs to be simplified.
The ‘plot’ function can be used to look at the diagnostic plots.
# Visualising the diagnostic plots of the model.
plot(baseballMaxModel)
Usually, the diagnostic plots are looked at only after an appropriate model is reached. Therefore, analysis of these plots in order to critique the model will be done later in the section.
Next, the ‘step’ function can be used to achieve a minimal adequate model. The aim of this is to use a minimal number of attributes that have significant impact to the model. Before using the ‘step’ function, the terms deviance and AIC need to be explained since the ‘step’ function produces these statistics.
Deviance is a measure of the model fit (the lack of model fit to be exact). Therefore, this depends on the error structure and link function. Due to this, a small deviance means that the model explains a lot of its variability and this is what is wanted.
AIC (Akaike Information Criterion) is another measure of model fit that takes into account the number of parameters and is found using: \[AIC = deviance + p\] Where p is the number of parameters. Similar to the deviance, having a low AIC would be what is ideally expected.
A model with a lot of parameters could have a low deviance, however, AIC takes into account to number of parameters and therefore will not have a low AIC. AIC is important since adding parameters increases the complexity, but this is not reflected on the deviance value.
# Using the ‘step’ function can be used to achieve a minimal adequate model.
step(baseballMaxModel)
Start: AIC=2404.82
Salary ~ RunsBattedInCount + Weight + Height + CareerLength +
Age + HitInd
Df Sum of Sq RSS AIC
- Age 1 2.4120e+12 1.6016e+15 2402.9
- Height 1 7.8337e+12 1.6070e+15 2403.2
- RunsBattedInCount 1 2.9394e+13 1.6286e+15 2404.2
- Weight 1 3.2864e+13 1.6321e+15 2404.4
<none> 1.5992e+15 2404.8
- HitInd 1 7.3725e+13 1.6729e+15 2406.3
- CareerLength 1 4.2130e+14 2.0205e+15 2421.1
Step: AIC=2402.94
Salary ~ RunsBattedInCount + Weight + Height + CareerLength +
HitInd
Df Sum of Sq RSS AIC
- Height 1 9.9793e+12 1.6116e+15 2401.4
- Weight 1 3.1909e+13 1.6335e+15 2402.5
- RunsBattedInCount 1 3.7654e+13 1.6393e+15 2402.8
<none> 1.6016e+15 2402.9
- HitInd 1 7.3079e+13 1.6747e+15 2404.4
- CareerLength 1 7.8851e+14 2.3901e+15 2432.2
Step: AIC=2401.42
Salary ~ RunsBattedInCount + Weight + CareerLength + HitInd
Df Sum of Sq RSS AIC
- RunsBattedInCount 1 3.0399e+13 1.6420e+15 2400.9
<none> 1.6116e+15 2401.4
- Weight 1 5.9258e+13 1.6709e+15 2402.2
- HitInd 1 6.3659e+13 1.6753e+15 2402.4
- CareerLength 1 8.3417e+14 2.4458e+15 2432.0
Step: AIC=2400.88
Salary ~ Weight + CareerLength + HitInd
Df Sum of Sq RSS AIC
<none> 1.6420e+15 2400.9
- Weight 1 7.3836e+13 1.7158e+15 2402.3
- HitInd 1 1.6010e+14 1.8021e+15 2406.1
- CareerLength 1 9.3101e+14 2.5730e+15 2433.9
Call:
lm(formula = Salary ~ Weight + CareerLength + HitInd, data = myModellingSubset)
Coefficients:
(Intercept) Weight CareerLength HitInd
-13866939 45469 996156 3037167
The results from the step function can be used to build the minimal adequate model.
# Building a minimal adequate model.
baseballMinModel <- lm(Salary~Weight+CareerLength+HitInd, data=myModellingSubset)
# Visualising a summary of the model created.
summary(baseballMinModel)
Call:
lm(formula = Salary ~ Weight + CareerLength + HitInd, data = myModellingSubset)
Residuals:
Min 1Q Median 3Q Max
-10139838 -2546424 -861654 1503541 13363942
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13866939 5518793 -2.513 0.01416 *
Weight 45469 24926 1.824 0.07216 .
CareerLength 996156 153787 6.478 9.11e-09 ***
HitInd 3037167 1130704 2.686 0.00892 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4711000 on 74 degrees of freedom
Multiple R-squared: 0.5069, Adjusted R-squared: 0.4869
F-statistic: 25.36 on 3 and 74 DF, p-value: 2.176e-11
The results produced is as follows:
While this model is significantly simpler than the previous model, it is clear that not all of the attributes in this model are significant. However, the F score and r squared values seem to be fine so it was decided that further inspection could be carried out on the model (i.e. look at the diagnostic plots).
Seeing as the weight in the ‘baseballMinModel’ was not significant, a new model was built removing the weight (the code used for this has been commented out). However, it was found that removing this reduced the goodness of fit and the percentage of the model that was explained by the variance. Therefore, it was decided that the previous model would be left as it is.
# Building a minimal adequate model.
#baseballMinModel02 <- update(baseballMinModel,~.-Weight)
# Visualising a summary of the model created.
#summary(baseballMinModel02)
It was also clear that the binary explanatory variable (‘HitInd’) added to the baseballMaxModel does not add any significance to the model. Therefore, a new model can be created without the binary explanatory variable. This can be done using a number of different methods, two of which have been demonstrated (with one method being commented out).
# Creating new model - after removing one variable.
baseballMaxModel02 <- update(baseballMaxModel,~.-HitInd)
# Alternative method of creating the same model:
# Building a model for the baseball dataset - without the binary explanatory variable.
#baseballMaxModel02 <- lm(Salary~RunsBattedInCount+Weight+Height+CareerLength+Age, data=myModellingSubset)
# Visualising a summary of the model created.
summary(baseballMaxModel02)
Call:
lm(formula = Salary ~ RunsBattedInCount + Weight + Height + CareerLength +
Age, data = myModellingSubset)
Residuals:
Min 1Q Median 3Q Max
-9445814 -2373626 -767028 1049039 15343185
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8298528 20517580 -0.404 0.6871
RunsBattedInCount 49256 24697 1.994 0.0499 *
Weight 34712 28480 1.219 0.2269
Height 26429 280304 0.094 0.9251
CareerLength 1051203 232949 4.513 2.44e-05 ***
Age -56499 204940 -0.276 0.7836
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4820000 on 72 degrees of freedom
Multiple R-squared: 0.4976, Adjusted R-squared: 0.4627
F-statistic: 14.26 on 5 and 72 DF, p-value: 1.081e-09
The results produced is as follows:
Upon removing one variable from the model, it is important to understand if this dropped variable lead to a significant different in the different statistics (F ratio, Degree of Fit (r^2), Significance of r coefficients, and Residual Standard Error). To check this, all of the statistics can be looked at side by side:
It is clear that there is no significant difference in any of the statistics, confirming that removing the binary explanatory variable was seen as being appropriate. This is because the aim is to have a minimum number of variables that significantly impact the model.
It should be noted that there is another way in which a model could be built. This would take into account the different interactions between the explanatory variables. A regression tree could be built to check for interactions in the dataset. Two methods for building the tree model have been demonstrated, with the second method (used only because all of the attributes are included in the model) being commented out.
# Building a model to check for interactions.
baseballTreeModel<-tree(Salary~GamesCount+RunsCount+HitsCount+AtBatCount+RunsBattedInCount+Weight+Height+CareerLength+Age+HitInd,data=myModellingSubset)
# Building a model to check for interactions.
#baseballTreeModel<-tree(Salary~.,data=myModellingSubset)
# Plotting the tree model.
plot(baseballTreeModel)
text(baseballTreeModel)
Using the tree model it is clear that the career length is the attribute that is most significantly affecting the salary. It was also clear that the games count and age were the two other attributes that affect the salary.
Seeing as there are only three attributes affecting the salary, it was decided that the most complex version of a model that includes these three attributes could be built.
# Building a model using the interactions.
baseballTreeModel <- lm(Salary~GamesCount+CareerLength+Age+I(GamesCount^2)+I(CareerLength^2)+I(Age^2), data=myModellingSubset)
# Visualising a summary of the model created.
summary(baseballTreeModel)
Call:
lm(formula = Salary ~ GamesCount + CareerLength + Age + I(GamesCount^2) +
I(CareerLength^2) + I(Age^2), data = myModellingSubset)
Residuals:
Min 1Q Median 3Q Max
-8509590 -2662398 -499936 1736917 14761721
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.965e+07 2.858e+07 -1.037 0.30314
GamesCount -4.124e+04 4.794e+04 -0.860 0.39259
CareerLength 1.610e+06 4.870e+05 3.305 0.00149 **
Age 1.965e+06 1.951e+06 1.008 0.31708
I(GamesCount^2) 5.068e+02 3.096e+02 1.637 0.10602
I(CareerLength^2) -4.141e+04 3.294e+04 -1.257 0.21288
I(Age^2) -3.333e+04 3.241e+04 -1.028 0.30725
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4748000 on 71 degrees of freedom
Multiple R-squared: 0.5192, Adjusted R-squared: 0.4786
F-statistic: 12.78 on 6 and 71 DF, p-value: 9.928e-10
The results produced is as follows:
While this model does contain attributes that significantly affect it, there are a number of attributes that do not significantly affect the model. Therefore, it was decided that this model could be simplified further using the step function.
# Using the ‘step’ function can be used to achieve a minimal adequate model.
step(baseballTreeModel)
Start: AIC=2404.91
Salary ~ GamesCount + CareerLength + Age + I(GamesCount^2) +
I(CareerLength^2) + I(Age^2)
Df Sum of Sq RSS AIC
- GamesCount 1 1.6683e+13 1.6176e+15 2403.7
- Age 1 2.2891e+13 1.6238e+15 2404.0
- I(Age^2) 1 2.3847e+13 1.6248e+15 2404.1
- I(CareerLength^2) 1 3.5626e+13 1.6365e+15 2404.6
<none> 1.6009e+15 2404.9
- I(GamesCount^2) 1 6.0435e+13 1.6614e+15 2405.8
- CareerLength 1 2.4635e+14 1.8473e+15 2414.1
Step: AIC=2403.72
Salary ~ CareerLength + Age + I(GamesCount^2) + I(CareerLength^2) +
I(Age^2)
Df Sum of Sq RSS AIC
- Age 1 2.2494e+13 1.6401e+15 2402.8
- I(Age^2) 1 2.3657e+13 1.6413e+15 2402.8
- I(CareerLength^2) 1 2.8508e+13 1.6461e+15 2403.1
<none> 1.6176e+15 2403.7
- I(GamesCount^2) 1 1.8174e+14 1.7993e+15 2410.0
- CareerLength 1 2.3227e+14 1.8499e+15 2412.2
Step: AIC=2402.79
Salary ~ CareerLength + I(GamesCount^2) + I(CareerLength^2) +
I(Age^2)
Df Sum of Sq RSS AIC
- I(Age^2) 1 2.0077e+12 1.6421e+15 2400.9
<none> 1.6401e+15 2402.8
- I(CareerLength^2) 1 8.0834e+13 1.7209e+15 2404.5
- I(GamesCount^2) 1 1.6450e+14 1.8046e+15 2408.2
- CareerLength 1 4.2126e+14 2.0614e+15 2418.6
Step: AIC=2400.89
Salary ~ CareerLength + I(GamesCount^2) + I(CareerLength^2)
Df Sum of Sq RSS AIC
<none> 1.6421e+15 2400.9
- I(CareerLength^2) 1 8.2469e+13 1.7246e+15 2402.7
- I(GamesCount^2) 1 1.9115e+14 1.8333e+15 2407.5
- CareerLength 1 4.5528e+14 2.0974e+15 2418.0
Call:
lm(formula = Salary ~ CareerLength + I(GamesCount^2) + I(CareerLength^2),
data = myModellingSubset)
Coefficients:
(Intercept) CareerLength I(GamesCount^2) I(CareerLength^2)
-2326806.8 1745543.4 242.8 -53376.1
The minimal adequate model found using the step function could now be built.
# Building a model using the interactions.
baseballTreeStepModel <- lm(Salary~CareerLength+I(GamesCount^2)+I(CareerLength^2), data=myModellingSubset)
# Visualising a summary of the model created.
summary(baseballTreeStepModel)
Call:
lm(formula = Salary ~ CareerLength + I(GamesCount^2) + I(CareerLength^2),
data = myModellingSubset)
Residuals:
Min 1Q Median 3Q Max
-9081542 -2839364 -839553 2035467 14694310
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.327e+06 1.181e+06 -1.971 0.05251 .
CareerLength 1.746e+06 3.854e+05 4.530 2.22e-05 ***
I(GamesCount^2) 2.428e+02 8.274e+01 2.935 0.00444 **
I(CareerLength^2) -5.338e+04 2.769e+04 -1.928 0.05772 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4711000 on 74 degrees of freedom
Multiple R-squared: 0.5069, Adjusted R-squared: 0.4869
F-statistic: 25.35 on 3 and 74 DF, p-value: 2.181e-11
The results produced is as follows:
Since this model contains an F score and r squared value that could be considered as being ‘fair’, and all of the attributes in this model are significant, it was decided that this model would be explored further.
On the whole, five main models were built in this section.
Model building is usually a sequential process where one model is critiqued, then improved, then critiqued again, until a model that is appropriate is produced. Therefore usually, there is only one model with which diagnostics plots are used. However, since both models were seen as being ‘appropriate’, diagnostics plots for both can be visualised to critique them further.
The previously given description of the plot function will be used to interpret these diagnostic plots
# Visualising the diagnostic plots of the model.
plot(baseballMinModel)
The results produced is as follows:
This model has a good F score and r squared value, the plots show that the model is ‘fine’ (with the exception that there might be a violating of the linearity assumption). It was also found that there were a few outliers as well (point 24, 28, 29, 39, and 59).
The coefficients of this model can be looked at the generate the equation for this model.
# Visualising the model coefficients.
coef(baseballMinModel)
(Intercept) Weight CareerLength HitInd
-13866939.38 45469.37 996155.95 3037166.97
Using the coefficients, the equation for the model would be as follows.
\[Salary = 45469.37*Weight + 996155.95*Career\ Length + 3037166.97*Hit\ Ind -10829772.41\]
The diagnostics plots of the ‘baseballTreeStepModel’ can also be looked at.
# Visualising the diagnostic plots of the model.
plot(baseballTreeStepModel)
The results produced is as follows:
This model has a good F score and r squared value, the plots show that the model is ‘fine’. When comparing these findings to the previous model, it could also the thought that this model is ‘better’. In addition to this, it was also found that there were a few outliers as well (point 10, 28, 29, and 39). The fact that some of these points were found to be outliers in both models confirms that there is indeed a problem with a few of these points.
The coefficients of this model can be looked at the generate the equation for this model.
# Visualising the model coefficients.
coef(baseballTreeStepModel)
(Intercept) CareerLength I(GamesCount^2) I(CareerLength^2)
-2326806.8266 1745543.4346 242.8344 -53376.1438
Using the coefficients, the equation for the model would be as follows.
\[Salary = 1745543.43*Career\ Length + 242.83*Games\ Count^{2} - 53376.14*Career\ Length^{2} -2326806.83\]
To reiterate, the model building process is usually done sequentially, however, the plots of two different models were looked at (together) in this instance because two different approaches were used to build these models. This also provided the chance to demonstrate how different diagnostic plots could be interpreted.
To conclude, it was decided that the ‘baseballTreeStepModel’ (with the equation \(Salary = 1745543.43*Career\ Length + 242.83*Games\ Count^{2} - 53376.14*Career\ Length^{2} -2326806.83\)) was the best model as it contained a good F score, good r squared value, and the diagnostic plots showed that this model was appropriate. However, this model could also be improved further, more information for which can be found in the next section.
When a model is deemed to not be appropriate for the given research question, regardless of whether liner regression or multiple regression was used, it is thought that model transformation would be the best solution for this. When transforming a model, there are a number of different options that can be used, which includes (but not limited to):
An example could be used to demonstate how model transformation should usually work. If a linear regression model is in the from “y = a + bx” and is seen as not being the right model for the correlation equation, a model transformation could be carried out where ‘Log X’ or ‘Log Y’ could be used instead of X or Y respectively.
Log X: y = a + b*(logx) \[y = a\ +\ b \times \log x\]
Log Y: logy = a + bx could also be written as y = exp(a + bx) \[y = {exp} (a\ +\ b x)\]
This is done in the hope that this newly transformed model is a better suited model for the given research question. It should be noted that while the steps for model transformation are not straightforward (i.e. set in stone) and is usually a case of trial and error, the ‘model characterises, goodness of fit, and diagnostics plots’ give some insights as to which transformation needs to be applied.
This information can now be used to suggest improvements or an alternate approach to model building to address the findings from the previous sections
Using the insights found during the process of critiquing the model, an appropriate transformation can be applied.
From the insights gained by using the diagnostic plots, the Log X transformation could be appropriate for this model. The piece of code that can be used to carry out this transformation on the ‘baseballTreeStepModel’ can be found below. After carrying out the transformation, the summary of the model is looked at (i.e. F score and r squared values – similar to before), the model then critiqued and improved as needed. The diagnostic plots can be visualised again at this stage to provide insights as to how any further transformation needs to be carried out.
# Building a model using the interactions.
baseballTreeLogModel <- lm(Salary~log(CareerLength)+log(I(GamesCount^2))+log(I(CareerLength^2)), data=myModellingSubset)
# Visualising a summary of the model created.
summary(baseballTreeLogModel)
Call:
lm(formula = Salary ~ log(CareerLength) + log(I(GamesCount^2)) +
log(I(CareerLength^2)), data = myModellingSubset)
Residuals:
Min 1Q Median 3Q Max
-7396119 -4258617 -1448533 3119805 16315864
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2277132 2139775 -1.064 0.2907
log(CareerLength) 3098544 591965 5.234 1.46e-06 ***
log(I(GamesCount^2)) 559241 302287 1.850 0.0682 .
log(I(CareerLength^2)) NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5407000 on 75 degrees of freedom
Multiple R-squared: 0.3416, Adjusted R-squared: 0.324
F-statistic: 19.45 on 2 and 75 DF, p-value: 1.563e-07
As previously stated, this model was built to demonstrate how tranformation could be carried out.
Alternately, another approach could also be used to build the model
Another approach would be to use a generalised additive model (Ross, 2020) – which can be used to understand complex datasets. When building models, there are several factors to consider such as flexibility and interpretability. It is known that linear models are easy to interpret (i.e. simple) even though they do not provide a lot of flexibility. On the other hand, machine learning models are flexible, but are not simple to interpret in any way. Further, as a student studying an MSc in Artificial Intelligence, machine learning models require “huge” datasets for them to work – which is sometimes not available. Generalised additive models (Ross, 2020) as seen as the middle ground providing flexibility and interpretability to models, using datasets that are not “huge”. A generalised additive model can be built using the attributes in the ‘baseballTreeStepModel’, with the code used for this being shown below. Similar to before, this model can also be critiqued and improved till an appropriate model is reached.
# Building a generalised additive model.
baseballGAMTreeModel <- gam(Salary~CareerLength+I(GamesCount^2)+I(CareerLength^2), data=myModellingSubset)
# Visualising a summary of the model created.
summary(baseballGAMTreeModel)
Family: gaussian
Link function: identity
Formula:
Salary ~ CareerLength + I(GamesCount^2) + I(CareerLength^2)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.327e+06 1.181e+06 -1.971 0.05251 .
CareerLength 1.746e+06 3.854e+05 4.530 2.22e-05 ***
I(GamesCount^2) 2.428e+02 8.274e+01 2.935 0.00444 **
I(CareerLength^2) -5.338e+04 2.769e+04 -1.928 0.05772 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.487 Deviance explained = 50.7%
GCV = 2.339e+13 Scale est. = 2.2191e+13 n = 78
Looking at these statistics, it can be said that this is a ‘good’ model with a fair r squared value and a lot of its variance being explained.
It should be noted that model building and suggesting improvements to a given model involves some trial and error. It should also be understood that a model can always be critiques and in turn, expected to be improved. However, a conscious decision needs to be made to stop the process of critiquing and improving a model once as adequate solution is reached. For example: for the research question given above, either model (Log X, Log Y, or the generalised additive model) could have ‘problems’, but the model building process needs to stop ones an appropriate model has been reached.
In conclusion, both the ‘baseballTreeStepModel’ (last model in the previous section) and the ‘baseballGAMTreeModel’ are seen to be suitable and either could be used depending on the situation.
When it comes to building any model involving a binary target attribute (i.e. the dependent variable is binary), logistic regression (Le, 2018) should be the type of model to be used. It should be noted that logistic regression is a type of generalised linear model that is used to find the probability of a certain class/event existing or not (i.e. winning/losing, passing/failing, getting accepted/rejected). The explanatory variable can be dependent but the dependent variable always has two outcomes (is binary). It should be noted that this is a type of generalised linear model.
When looking at logistic regression, three aspects of the output needs to be checked:
A second subset will be created to be used for logistic regression model.
# Creating a new modelling subset.
myModellingSubset02 <- myUpdatedSubset
When it comes to building a logistic regression model given a research question, a number of steps will need to be followed. Similar to the previous model, the first step usually is to load the data correctly, which has been done already in this instance. The next step usually is to visualise the dataset numerically and graphically – for which the ‘summary’ function will be used. The data from this dataset was already visualised graphically in the previous section (with the findings being reported as well), therefore, further graphical visualisation does not need to be carried out.
The data cleaning and EDA previously carried out gave insights into the different attributes of the dataset. This also showed the correlations between some of the variables, making it possible to only include one highly correlated variable in the model. EDA carried out also gave further insights into the target attribute (HitInd) when compared to all of the other attributes. The exploratory data analysis carried out previously showed that the count of games, runs, hits, at bat, runs batted in, height, salary, career length, and age might all have an effect on a player’s ability to score a hit, it was decided that all of these attributes would be included in the initial model.
Since ‘HitInd’ is the dependent variable, the correlation between ‘HitInd’ and all of the highly correlated variables (‘GamesCount’, ‘RunsCount’, ‘HitsCount’, ‘AtBatCount’, and ‘RunsBattedInCount’) can be calculated separately to determine which attribute to use in the model.
As ‘HitInd’ is a factor, it will first need to be converted into a double before the correlation between ‘HitInd’ and the other attributes can be found.
# Changing data type of HitInd.
myModellingSubset02$HitInd <- as.double(myModellingSubset02$HitInd)
The correlation between ‘HitInd’ and the other attributes can now be found.
# Calculating the correlation between HitInd & GamesCount.
cor.test(myModellingSubset02$HitInd, myModellingSubset02$GamesCount)
Pearson's product-moment correlation
data: myModellingSubset02$HitInd and myModellingSubset02$GamesCount
t = 5.3676, df = 76, p-value = 8.361e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3416265 0.6687990
sample estimates:
cor
0.5242988
# Calculating the correlation between HitInd & RunsCount.
cor.test(myModellingSubset02$HitInd, myModellingSubset02$RunsCount)
Pearson's product-moment correlation
data: myModellingSubset02$HitInd and myModellingSubset02$RunsCount
t = 6.5621, df = 76, p-value = 5.798e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4373961 0.7266726
sample estimates:
cor
0.601391
# Calculating the correlation between HitInd & HitsCount.
cor.test(myModellingSubset02$HitInd, myModellingSubset02$HitsCount)
Pearson's product-moment correlation
data: myModellingSubset02$HitInd and myModellingSubset02$HitsCount
t = 6.3846, df = 76, p-value = 1.236e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4240863 0.7188679
sample estimates:
cor
0.5908566
# Calculating the correlation between HitInd & AtBatCount.
cor.test(myModellingSubset02$HitInd, myModellingSubset02$AtBatCount)
Pearson's product-moment correlation
data: myModellingSubset02$HitInd and myModellingSubset02$AtBatCount
t = 7.03, df = 76, p-value = 7.708e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4709885 0.7460443
sample estimates:
cor
0.6277275
# Calculating the correlation between HitInd & RunsBattedInCount.
cor.test(myModellingSubset02$HitInd, myModellingSubset02$RunsBattedInCount)
Pearson's product-moment correlation
data: myModellingSubset02$HitInd and myModellingSubset02$RunsBattedInCount
t = 5.9463, df = 76, p-value = 7.824e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3898403 0.6984388
sample estimates:
cor
0.5634877
Since it was clear that all of the attributes have a similar correlation, it was decided that the ‘GamesCount’ was going to be used in the model.
The datatype of ‘HitInd’ will now need to be changed back to factor so as to not affect the rest of the modelling process.
# Changing (back) data type of HitInd.
myModellingSubset02$HitInd <- as.factor(myModellingSubset02$HitInd)
While it was decided that the ‘GamesCount’ was going to be used in the model, the commented out code below shows how models could have been built with either of the highly correlated attributes (‘GamesCount’, ‘RunsCount’, ‘HitsCount’, ‘AtBatCount’, and ‘RunsBattedInCount’).
# Building a model logistic regression using 'GamesCount'.
#hitIndGamesModel <- glm(myModellingSubset02$HitInd~myModellingSubset02$GamesCount+myModellingSubset02$Weight+myModellingSubset02$Height+myModellingSubset02$Salary+myModellingSubset02$Birthdate+myModellingSubset02$CareerLength+myModellingSubset02$Bats+myModellingSubset02$Age, family=binomial)
# Visualising a summary of the model created.
#summary(hitIndGamesModel)
# Building a model logistic regression using 'RunsCount'.
#hitIndRunsModel <- glm(myModellingSubset02$HitInd~myModellingSubset02$TeamID+myModellingSubset02$RunsCount+myModellingSubset02$Weight+myModellingSubset02$Height+myModellingSubset02$Salary+myModellingSubset02$Birthdate+myModellingSubset02$CareerLength+myModellingSubset02$Bats+myModellingSubset02$Age, family=binomial)
# Visualising a summary of the model created.
#summary(hitIndRunsModel)
# Building a model logistic regression using 'HitsCount'.
#hitIndHitsModel <- glm(myModellingSubset02$HitInd~myModellingSubset02$TeamID+myModellingSubset02$HitsCount+myModellingSubset02$Weight+myModellingSubset02$Height+myModellingSubset02$Salary+myModellingSubset02$Birthdate+myModellingSubset02$CareerLength+myModellingSubset02$Bats+myModellingSubset02$Age, family=binomial)
# Visualising a summary of the model created.
#summary(hitIndHitsModel)
# Building a model logistic regression using 'AtBatCount'.
#hitIndAtBatCountModel <- glm(myModellingSubset02$HitInd~myModellingSubset02$TeamID+myModellingSubset02$AtBatCount+myModellingSubset02$Weight+myModellingSubset02$Height+myModellingSubset02$Salary+myModellingSubset02$Birthdate+myModellingSubset02$CareerLength+myModellingSubset02$Bats+myModellingSubset02$Age, family=binomial)
# Visualising a summary of the model created.
#summary(hitIndAtBatCountModel)
# Building a model logistic regression using 'RunsBattedInCount'.
#hitIndRunsBattedInCountModel <- glm(myModellingSubset02$HitInd~myModellingSubset02$TeamID+myModellingSubset02$RunsBattedInCount+myModellingSubset02$Weight+myModellingSubset02$Height+myModellingSubset02$Salary+myModellingSubset02$Birthdate+myModellingSubset02$CareerLength+myModellingSubset02$Bats+myModellingSubset02$Age, family=binomial)
# Visualising a summary of the model created.
#summary(hitIndRunsBattedInCountModel)
A logistic regression model can now be build using the ‘GamesCount’ (out of all the highly correlated attributes) and all of the other attributes in the dataset.
# Building a logistic regression model.
hitIndModel <- glm(myModellingSubset02$HitInd~myModellingSubset02$TeamID+myModellingSubset02$GamesCount+myModellingSubset02$Weight+myModellingSubset02$Height+myModellingSubset02$Salary+myModellingSubset02$Birthdate+myModellingSubset02$CareerLength+myModellingSubset02$Bats+myModellingSubset02$Age, family=binomial)
# Visualising a summary of the model created.
summary(hitIndModel)
Call:
glm(formula = myModellingSubset02$HitInd ~ myModellingSubset02$TeamID +
myModellingSubset02$GamesCount + myModellingSubset02$Weight +
myModellingSubset02$Height + myModellingSubset02$Salary +
myModellingSubset02$Birthdate + myModellingSubset02$CareerLength +
myModellingSubset02$Bats + myModellingSubset02$Age, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.86177 -0.53137 0.00753 0.43291 2.35514
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.773e+01 1.891e+01 0.938 0.34833
myModellingSubset02$TeamIDTEX 1.182e+00 7.908e-01 1.495 0.13485
myModellingSubset02$GamesCount 5.040e-02 1.908e-02 2.641 0.00826 **
myModellingSubset02$Weight 1.675e-02 2.104e-02 0.796 0.42610
myModellingSubset02$Height -4.340e-01 2.174e-01 -1.997 0.04585 *
myModellingSubset02$Salary 2.678e-07 1.164e-07 2.302 0.02135 *
myModellingSubset02$Birthdate 5.178e-04 5.129e-04 1.010 0.31271
myModellingSubset02$CareerLength 5.002e-02 1.902e-01 0.263 0.79253
myModellingSubset02$BatsL -2.208e+00 1.450e+00 -1.523 0.12778
myModellingSubset02$BatsR -1.596e+00 1.346e+00 -1.186 0.23552
myModellingSubset02$Age 1.840e-01 1.741e-01 1.057 0.29055
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 107.67 on 77 degrees of freedom
Residual deviance: 54.51 on 67 degrees of freedom
AIC: 76.51
Number of Fisher Scoring iterations: 7
From the summary of the model, it is clear that the ‘GamesCount’, ‘Height’, and ‘Salary’ are significant while the other attributes are not.
It should be noted that ‘Odds’ and ‘Odds Ratio’ are concepts that are discussed as part of logistic regression. Simply put, Odds is the probability of getting one divided by the probability of getting zero and can be written as:
\[odds(Y = 1) = \frac {P(Y = 1)}{1 – P(Y = 1)}\]
This can be simplified and represented as: \[odds(Y = 1) = \frac {P(Y = 1)}{P(Y = 0)}\]
Odds Ratio is an extension of ‘Odds’ and is another useful measure that takes into account another event [X] having one outcome [a] or the other [b]. Generally, if the Odds Ratio is equal to one [1], it is thought that the event [Y] is equally likely no matter if ‘X=a’ or ‘X=b’. If the Odds Ratio is more than one [1], it is thought that the event [Y] is more likely to occur when ‘X=a’ than if ‘X=b’. If the Odds Ratio is less than one [1], it is thought that the event [Y] is less likely to occur when ‘X=a’ than if ‘X=b’.
# Visualising the odds ratios.
exp(coef(hitIndModel))
(Intercept) myModellingSubset02$TeamIDTEX myModellingSubset02$GamesCount myModellingSubset02$Weight
5.020656e+07 3.262475e+00 1.051688e+00 1.016886e+00
myModellingSubset02$Height myModellingSubset02$Salary myModellingSubset02$Birthdate myModellingSubset02$CareerLength
6.479077e-01 1.000000e+00 1.000518e+00 1.051288e+00
myModellingSubset02$BatsL myModellingSubset02$BatsR myModellingSubset02$Age
1.099651e-01 2.026609e-01 1.202054e+00
By looking at these results it can be said that a player in the Texas Rangers team is more likely to score a hit than a player in the Boston Red Sox team. It can also be said that the number of games a player has played in, the weight of a player, height of a player, and career length all slightly affect the chance a player would score a hit, with an increase in any of them leading to an increased chance that the player would score a hit. It is understood that a player batting with the right hand has a higher chance of scoring a hit while an increase in a players’ age would lead to an almost 20% increase in the chance of the player scoring a hit.
Next, the ‘step’ function can be used to simplify the model. The aim of this is to use a minimal number of attributes that have significant impact to the model. It should be noted that the model can be simplified manually as well, remobving attributes that are not significant and visualsing the model again. Attributes can be removed from the model by using the line of code: “hitIndModel02<-update(hitIndModel,~.-TeamID)”.
# Using the ‘step’ function can be used to achieve a minimal adequate model.
step(hitIndModel)
Start: AIC=76.51
myModellingSubset02$HitInd ~ myModellingSubset02$TeamID + myModellingSubset02$GamesCount +
myModellingSubset02$Weight + myModellingSubset02$Height +
myModellingSubset02$Salary + myModellingSubset02$Birthdate +
myModellingSubset02$CareerLength + myModellingSubset02$Bats +
myModellingSubset02$Age
Df Deviance AIC
- myModellingSubset02$CareerLength 1 54.579 74.579
- myModellingSubset02$Weight 1 55.138 75.138
- myModellingSubset02$Bats 2 57.229 75.229
- myModellingSubset02$Birthdate 1 55.599 75.599
- myModellingSubset02$Age 1 55.694 75.694
<none> 54.510 76.510
- myModellingSubset02$TeamID 1 56.940 76.940
- myModellingSubset02$Height 1 59.385 79.385
- myModellingSubset02$Salary 1 63.659 83.659
- myModellingSubset02$GamesCount 1 68.251 88.251
Step: AIC=74.58
myModellingSubset02$HitInd ~ myModellingSubset02$TeamID + myModellingSubset02$GamesCount +
myModellingSubset02$Weight + myModellingSubset02$Height +
myModellingSubset02$Salary + myModellingSubset02$Birthdate +
myModellingSubset02$Bats + myModellingSubset02$Age
Df Deviance AIC
- myModellingSubset02$Weight 1 55.268 73.268
- myModellingSubset02$Bats 2 57.404 73.404
- myModellingSubset02$Birthdate 1 55.625 73.625
- myModellingSubset02$Age 1 55.853 73.853
<none> 54.579 74.579
- myModellingSubset02$TeamID 1 57.216 75.216
- myModellingSubset02$Height 1 59.418 77.418
- myModellingSubset02$Salary 1 67.212 85.212
- myModellingSubset02$GamesCount 1 68.977 86.977
Step: AIC=73.27
myModellingSubset02$HitInd ~ myModellingSubset02$TeamID + myModellingSubset02$GamesCount +
myModellingSubset02$Height + myModellingSubset02$Salary +
myModellingSubset02$Birthdate + myModellingSubset02$Bats +
myModellingSubset02$Age
Df Deviance AIC
- myModellingSubset02$Birthdate 1 56.144 72.144
- myModellingSubset02$Age 1 56.368 72.368
- myModellingSubset02$Bats 2 59.027 73.027
<none> 55.268 73.268
- myModellingSubset02$TeamID 1 57.803 73.803
- myModellingSubset02$Height 1 59.449 75.449
- myModellingSubset02$Salary 1 67.472 83.472
- myModellingSubset02$GamesCount 1 69.738 85.738
Step: AIC=72.14
myModellingSubset02$HitInd ~ myModellingSubset02$TeamID + myModellingSubset02$GamesCount +
myModellingSubset02$Height + myModellingSubset02$Salary +
myModellingSubset02$Bats + myModellingSubset02$Age
Df Deviance AIC
- myModellingSubset02$Age 1 56.391 70.391
- myModellingSubset02$Bats 2 59.729 71.729
<none> 56.144 72.144
- myModellingSubset02$TeamID 1 58.331 72.331
- myModellingSubset02$Height 1 62.416 76.416
- myModellingSubset02$Salary 1 67.696 81.696
- myModellingSubset02$GamesCount 1 69.740 83.740
Step: AIC=70.39
myModellingSubset02$HitInd ~ myModellingSubset02$TeamID + myModellingSubset02$GamesCount +
myModellingSubset02$Height + myModellingSubset02$Salary +
myModellingSubset02$Bats
Df Deviance AIC
- myModellingSubset02$TeamID 1 58.389 70.389
<none> 56.391 70.391
- myModellingSubset02$Bats 2 60.445 70.445
- myModellingSubset02$Height 1 62.994 74.994
- myModellingSubset02$GamesCount 1 69.841 81.841
- myModellingSubset02$Salary 1 70.873 82.873
Step: AIC=70.39
myModellingSubset02$HitInd ~ myModellingSubset02$GamesCount +
myModellingSubset02$Height + myModellingSubset02$Salary +
myModellingSubset02$Bats
Df Deviance AIC
<none> 58.389 70.389
- myModellingSubset02$Bats 2 63.465 71.465
- myModellingSubset02$Height 1 63.568 73.568
- myModellingSubset02$Salary 1 72.305 82.305
- myModellingSubset02$GamesCount 1 73.464 83.464
Call: glm(formula = myModellingSubset02$HitInd ~ myModellingSubset02$GamesCount +
myModellingSubset02$Height + myModellingSubset02$Salary +
myModellingSubset02$Bats, family = binomial)
Coefficients:
(Intercept) myModellingSubset02$GamesCount myModellingSubset02$Height myModellingSubset02$Salary
2.752e+01 4.454e-02 -3.857e-01 2.407e-07
myModellingSubset02$BatsL myModellingSubset02$BatsR
-2.814e+00 -1.777e+00
Degrees of Freedom: 77 Total (i.e. Null); 72 Residual
Null Deviance: 107.7
Residual Deviance: 58.39 AIC: 70.39
The simplified model produced by the step function can now be built.
# Building a logistic regression model - after using the step function.
hitIndStepModel <- glm(myModellingSubset02$HitInd~myModellingSubset02$GamesCount+myModellingSubset02$Height+myModellingSubset02$Salary+myModellingSubset02$Bats, family=binomial)
# Visualising a summary of the model created.
summary(hitIndStepModel)
Call:
glm(formula = myModellingSubset02$HitInd ~ myModellingSubset02$GamesCount +
myModellingSubset02$Height + myModellingSubset02$Salary +
myModellingSubset02$Bats, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.89901 -0.58985 0.01872 0.57706 1.86403
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.752e+01 1.342e+01 2.051 0.04027 *
myModellingSubset02$GamesCount 4.454e-02 1.486e-02 2.997 0.00273 **
myModellingSubset02$Height -3.857e-01 1.854e-01 -2.081 0.03746 *
myModellingSubset02$Salary 2.407e-07 8.317e-08 2.894 0.00381 **
myModellingSubset02$BatsL -2.814e+00 1.394e+00 -2.018 0.04354 *
myModellingSubset02$BatsR -1.777e+00 1.262e+00 -1.408 0.15912
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 107.669 on 77 degrees of freedom
Residual deviance: 58.389 on 72 degrees of freedom
AIC: 70.389
Number of Fisher Scoring iterations: 6
Visualising the summary of this newly built simplified model showed that all of the attributes in it are significant except for bats which is not. Therefore, the model can be simplified even further the include only ‘GamesCount’, ‘Height’, and ‘Salary’ (removing bats). It should also be noted that these were the only three attributes that were significant from the initial model that included all of the attributes.
# Building a logistic regression model - after using the step function.
hitIndStepUpdatedModel <- glm(myModellingSubset02$HitInd~myModellingSubset02$GamesCount+myModellingSubset02$Height+myModellingSubset02$Salary, family=binomial)
# Visualising a summary of the model created.
summary(hitIndStepUpdatedModel)
Call:
glm(formula = myModellingSubset02$HitInd ~ myModellingSubset02$GamesCount +
myModellingSubset02$Height + myModellingSubset02$Salary,
family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.79706 -0.68190 0.04099 0.56912 1.84012
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.890e+01 1.247e+01 2.318 0.02042 *
myModellingSubset02$GamesCount 3.465e-02 1.292e-02 2.682 0.00733 **
myModellingSubset02$Height -4.241e-01 1.706e-01 -2.487 0.01290 *
myModellingSubset02$Salary 2.174e-07 7.983e-08 2.723 0.00646 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 107.669 on 77 degrees of freedom
Residual deviance: 63.465 on 74 degrees of freedom
AIC: 71.465
Number of Fisher Scoring iterations: 6
As is always the case at this stage of the modelling process, there is not the dilemma of whether to go with the simpler model with the slightly higher AIC or the more complex model with a slightly lower AIC. Due to the fact that it is easier to explain the simpler model, this will be selected.
The simpler model which only uses ‘GamesCount’, ‘Height’, and ‘Salary’ will be chosen
The odds ratios of this model can also be looked at.
# Visualising the odds ratios.
exp(coef(hitIndStepUpdatedModel))
(Intercept) myModellingSubset02$GamesCount myModellingSubset02$Height myModellingSubset02$Salary
3.572054e+12 1.035253e+00 6.543331e-01 1.000000e+00
This states that with an increase in the number of games a player has played, there is almost a 4% (3.5% to be exact) increase in the chance that a player will score a hit. These results also show that a player has a higher chance of scoring a goal with an increase in their height while an increase in a players’ salary would not necessarily affect the chance of them scoring a hit.
# Visualising the odds ratios.
exp(cbind(OR=coef(hitIndStepUpdatedModel), confint(hitIndStepUpdatedModel)))
Waiting for profiling to be done...
OR 2.5 % 97.5 %
(Intercept) 3.572054e+12 470.7975390 1.421716e+24
myModellingSubset02$GamesCount 1.035253e+00 1.0127767 1.065763e+00
myModellingSubset02$Height 6.543331e-01 0.4531507 8.914412e-01
myModellingSubset02$Salary 1.000000e+00 1.0000001 1.000000e+00
By looking at these results, it is thought that the model could be simplified even further by using the following piece of code that has been commented out.
# Building a logistic regression model.
#hitIndStepUpdatedModel02 <- glm(myModellingSubset02$HitInd~myModellingSubset02$GamesCount+myModellingSubset02$Height, family=binomial)
# Visualising a summary of the model created.
#summary(hitIndStepUpdatedModel02)
While the model could be simplified even further, it is not done since the model is simple as it is and removing another attribute would unnecessarily increase the AIC.
Crawley, M. J., 2015. Statistics: an introduction using RJ Wiley. England: s.n. Elgabry, O., 2019. The Ultimate Guide to Data Cleaning. [Online] Available at: https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4 [Accessed 29 December 2020].
ESPN.co.uk, 2020. MLB raising minimum salary for minor leaguers in 2021. [Online] Available at: https://www.espn.co.uk/mlb/story/_/id/28702734/mlb-raising-minimum-salary-minor-leaguers-2021 [Accessed 20 December 2020].
Formplus, 2020. Data Cleaning: Definition, Methods, and Uses in Research. [Online] Available at: https://www.formpl.us/blog/data-cleaning [Accessed 19 November 2020].
Goldberg, J., 2020. Exploratory Data Analysis. [Online] Available at: https://r4ds.had.co.nz/exploratory-data-analysis.html [Accessed 01 December 2020].
Grant, P., 2019. Understanding Multiple Regression. [Online] Available at: https://towardsdatascience.com/understanding-multiple-regression-249b16bde83e [Accessed 02 December 2020].
IBM, 2020. Exploratory Data Analysis. [Online] Available at: https://www.ibm.com/cloud/learn/exploratory-data-analysis [Accessed 18 November 2020].
Kenton, W., 2020. Residual Standard Deviation. [Online] Available at: https://www.investopedia.com/terms/r/residual-standard-deviation.asp [Accessed 29 December 2020].
Le, J., 2018. Logistic Regression in R Tutorial. [Online] Available at: https://www.datacamp.com/community/tutorials/logistic-regression-R [Accessed 20 December 2020].
Mathewson, T., 2019. How young is too young to play professional sports? . [Online] Available at: https://globalsportmatters.com/culture/2019/04/25/how-young-is-too-young-to-play-professional-sports/#:~:text=The%20NHL’s%20age%20minimum%20is,and%2017%20for%20international%20players. [Accessed 18 December 2020].
NHS, 2020. Height and weight chart. [Online] Available at: https://www.nhs.uk/live-well/healthy-weight/height-weight-chart/ [Accessed 16 November 2020].
Portugués, E. G., 2020. Model Diagnostics. [Online] Available at: https://bookdown.org/egarpor/PM-UC3M/glm-diagnostics.html [Accessed 21 December 2020].
Ross, N., 2020. Generalised Additive Models. [Online] Available at: https://noamross.github.io/gams-in-r-course/ [Accessed 20 December 2020].
Rouse, M., 2019. Data Quality. [Online] Available at: https://searchdatamanagement.techtarget.com/definition/data-quality [Accessed 19 November 2020].
Saraswat, M., 2020. Practical Guide to Logistic Regression Analysis in R. [Online] Available at: https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/logistic-regression-analysis-r/tutorial/ [Accessed 29 December 2020].
Sirohi, K., 2018. Simply Explained Logistic Regression with Example in R. [Online] Available at: https://towardsdatascience.com/simply-explained-logistic-regression-with-example-in-r-b919acb1d6b3 [Accessed 03 December 2020].
Thewissen, S., Nolan, B. & Roser, M., 2015. Incomes across the Distribution. [Online] Available at: https://ourworldindata.org/incomes-across-distribution [Accessed 01 December 2020].
Tranmer, M. & Elliot, M., 2008. Multiple Linear Regression. [Online] Available at: https://hummedia.manchester.ac.uk/institutes/cmist/archive-publications/working-papers/2008/2008-19-multiple-linear-regression.pdf [Accessed 20 December 2020].
United Nations, 2000. Glossary of Terms on Statistical Data Editing. United Nations Statistical Commission and Economic Commission for Economic.